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Preamble 

Probability and Statistics 



PROBABILITY THEORY is an entirely mathematical subject. Starting 
with the description of a problem, we apply the axioms and rules of the theory 
to solve well-specified problems. Our starting point is a Probability Model, i.e. a 
fully specified distribution P for a given variable (or variables, or process. . . ). 

STATISTICS (or Statistical Inference) can be regarded as the inverse of Prob- 
ability: 

STATISTICS = (PROBABILITY)"\ 

We do not start with a known probability distribution and explore its mathemat- 
ically determined properties. Instead, we start with DATA, regarded as having 
been generated from some unknown probability distribution P, and want to say 
something sensible about P (perhaps so as to be able to say something sensible 
about future variables also generated from P). 

PROBABILITY: known P =^ properties of (as yet unobserved) observables X 

STATISTICS: observed values x =^ properties of unknown generating distri- 
bution P 

Although the top route is entirely mathematical, the bottom route is not! In- 
deed, how to say something sensible about the future on the basis of past data 
constitutes the essentially insoluble philosophical problem of INDUCTION. 

So. . . Statistics is impossible! And this is reflected in the fact that there are nu- 
merous different warring schools of thought (frequentist, Bayesian, likelihood,. . . ) 
as to how to conduct statistical inference. 
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Chapter I 



Statistical Models 



1 Statistical experiment 

Although we may not know the distribution {probability model) P that generated 
our data, we may be wilhng to confine it to some specified statistical model: 
P G P. For most of this course we take this model to be parametric, i.e. of the 
form 

where 6 is a finite or countable set, or a Euclidean space. More general forms of 
V are non-parametric. 

We will thus preface any statistical analysis with our specification of the relevant 
(parametric) statistical experiment: 

Definition 1 Statistical experiment 

A statistical experiment, S = (X, X, 6, O, V) comprises the following ingredients: 

• A measurabl^ sample space X 

• A (typically observable) random variable X taking values in X 

• A parameter space 6 

• A (typically unobservable) parameter variabl^ G with possible values in G 

^We shall not fuss about the associated a-field of measurable sets. In applications, X will 
almost always be either discrete, with all sets measurable, or a subset of a Euclidean space, 
equipped with its Borel a-algebra. 

^The use of different symbols for the name (0) and a possible value {6) of the parameter 
is non-standard, but marks a useful distinction. Note that the symbol Q is commonly used in 
place of our B to denote the parameter space. 
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• For each ^ G 6, a fully specified distribution over X, describing uncer- 
tainty about X, given e = eE (We shall also use the notation P{A\Q = 6) 
or P{A I 6), interchangeably with Pe(A)). 

□ 

The labelled family of distributions V := {Pg : 6 E £)} constitutes the associated 
statistical model. Our task is to "draw inference" about O from observation of 
X. 

In this course we shall focus almost exclusively on data that can be considered 
as having arisen as independent and identically distributed observations from a 
common probability distribution. Although cases where these assumptions can 
not be made are both practically and theoretically important, we shall have quite 
enough to handle in the independent and identically distributed case. So hence- 
forth we assume independent and identically distributed models except where 
otherwise noted. 

In our applications, the complete observable will thus comprise n independent 
and identically distributed observables from a common distribution, li £ = 
(X, X, 6, 0, V) describes the experiment relating to a single observation, that re- 
lating to n independent and identically distributed observations X = (Xi, . . . , X„) 
will be := (X",X, G, e,P"); here := {Pg^ : 9 e G}, where PJ^ denotes the 
joint distribution of X when the Xi are independent and identically distributed 
as Pg. In this case it is obviously enough to specify the single-observation exper- 
iment. 

A note on densities: We don't need much measure theory as such, but the lan- 
guage of measure theory provides an elegant and unified way of describing things. 

In all the cases we shall consider, there will be a dominating measure yU on X such 
that each Pg is absolutely continuous with respect to /i {i.e., fi{A) = ^ -Pe(^) = 
0). Then by the Radon-Nikodym theorem there will, for each 6, exist a density 
function ^ = p{x \ 0) (or, pg{x)), unique p.p. [/i], such that, for measurable A, 
Pg{X e A) = J^p{x\e)dij{x). For all our examples and applications, we shall 
have either X C R^, or X discrete (finite or countable). Unless otherwise noted, 
in the former case we shall assume /i = Lebesgue measure, so using the standard 
definition of probability density function; and, in the latter, fi = counting mea- 
sure, when the "density" becomes the probability mass function. However, even 
in these cases it is sometimes helpful to consider other dominating measures. 

The density function p{x \ 9) is only defined up to a set of //-measure 0. It will 

•^To avoid silly complications, we shall assume that 9 ^ 6' ^ Pq ^ Pg' — in this case the 
parameter is said to be identifiable. 
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typically be posible to choose unique versions of these functions that are con- 
tinuous in both arguments. This will henceforth be assumed unless otherwise 
noted. 



Likelihood 

An important ingredient of many inference methods is the likelihood functioT^ 
generated by an observation. 



Definition 2 Given a statistical experiment £ = (X,X, G, 0,P), siippose we 
observe X = x. The associated likelihood function for 6 is the functioro L : 6 — )■ 
IR+ defined by: 

L{e) ocp{x\e). 

□ 



When L{6) > on 6, we often work with the log-likelihood function l{9) ~ 
logL(^) (where ~ denotes that the two sides may differ by an additive constant). 

The value 6 (if it exists) maximising L{6) (or equivalently l{0)) is the maximum 
likelihood estimate (MLE) of (for data X = x). 



2 Some parametric statistical models 

Useful statistical models include: 

NORMAL MODEL 

X = R. 

= (M, 1/) G e = R X R+. 

"Likelihood" was introduced by Ronald Fisher, who introduced this new term to make a 
clear distinction between the likelihood function p{x | •), for given x a function on the parameter 
space 6 (transforming as a scalar), and the probability density p{-\0), for given a (non-scalar) 
function on the sample space X. Unfortunately this careful distinction is in danger of being 
lost, with the term "likelihood" increasingly employed as if it were synonymous with "statistical 
model" . 

^More properly, "the" likelihood function is not a unique function, but an equivalence class 
of functions, regarded as equivalent if they are related by multiplication by a positive scaling 
factor. To be well-defined, any procedure based on "the" likelihood function must be unaffected 
by such scalar multiplication: this will be the case for all those we shall consider. 
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When (M, V") = (/i, v), X A^(/i, v), with density 

_i ( (x - 
p{x \ fi,v) = {2'Kv) 2 exp — < — 

We have E(X | fx, v) = /i, var(X \ fi,v) = v. 
We are also often interested in the submodels: 

NORMAL MEAN MODEL: M is unknown but V = Vq is known. 
NORMAL VARIANCE MODEL: V is unknown but M = /io is known. 

BINOMIAL MODEL 

X = {0,l,2,...,n} 

e = [0, 1] 

Under Pq^ X ~ B{n] 6'), with density (probability mass function) 

p{x\e)= • ^(1-^)—. 

xl [n — xjl 

We have E(X | 6) = nO, var(X | 9) = ne{l - 9). 

NEGATIVE BINOMIAL MODEL 

X = {0,1,2,...} 

e = [0, 1] 

Uner Pq, X ~ UB{k] 9), with density 



We have E(X | ^) = k9/{l - 9), var(X | 9) = k9/{l - 9) 

POISSON MODEL 

X = {0,1,...} 

e = R+ 

Under Pq, X V{9), with density 

p(x\9) = e-^9yx\ 

We have E{X \9) = 9, var(X \9) = 9. 

GAMMA MODEL 

X = R+ 

Q = {A,B)e<B = R+ xR+ 

Given (A, 5) = (a, 6), X ~ r(a, 6), with density 

p{x\a,b) = -^x^-'e-'^ 
r(a) 
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where r(-) is the Gamma function^ 

T(a) := / x'^-^e-'^dx 



[For integer a, T{a) = {a — 1)!, more generally r(a + 1) = ar{a).] 
We have E(X \a,b) = a/b, var(X \a,b) = a/b"^. 

We could also entertain the submodels obtained by fixing A or B. For 
the special case that A is fixed at 1, we obtain the EXPONENTIAL 
MODEL: X ~ £{b). For B fixed at |, and reparametrising by N = 2 A, 



we obtain the CHI-SQUARED MODEL: xl = ^ 



BETA MODEL 

X = (0,1) 

e = e e = R+ X R+ 

Given {A,B) = {a,b), X ~ I3{a,b), with density 

Pi^\<^,b) = ^^^^x^~\l-x) 

r(a)r(6) 



We have E{X \a,b) = a/ {a + 6), var(X \a,b) = ab/{{a + bY {a + b + 1)}. 

UNIFORM MODEL 

X = R 

Q = {A,B) 

G = {(a, 6) e R X R : a < 6} 

Given (A, 5) = (a, 6), X ~ U(a, 6), with density 

p{x I a, 6) = (6 — a)""*^ (a < a; < &) 

Note: For this model it is not possible to choose the density function to be 
a continuous function of all its arguments. 

CAUCHY LOCATION MODEL 

X = R 
G = R 

Under P^, X ~ C{9, 1), with density 

I ^) = - TTT' — 

7r \ + [x — oy 

The mean and variance of this distribution do not exist. 

^Do not confuse the Gamma function and the Gamma distribution. 
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3 Exponential families 

A particularly versatile and tractable general form for a statistical model is the 
exponential family (EF)0 

Definition 3 Let V = {Pg : ^ G 6} be a family of probability distibutions on 
the measure space (X, A, m), all dominated by m, with densities p{x \ 9) = j^ix) 
defined for x G X, 6^ e 6. 

We call V a (p-parameter) exponential family if we can express 

p{x I 6) = exp{a(x) + b{e) + u{eft{x)} (1) 
for some functions a : X ^ R, 6 : 6 ^ R, m : 6 ^ Rp, t : X ^ Rf. □ 

If we introduce the reference measure fi defined by fi{A) = J^exp{a{x)} dm{x), 
then the density with respect to fi will be exp{6(^) +u{6)'^t{x)}, i.e. we can take 
a(x) = 0. 

The transformed parameter $ := m(B) is termed the natural or canonical pa- 
rameter. By the Fisher-Neyman factorisation criterion, the statistic T := t{X) is 
sufficient El it is termed the natural sufficient or canonical statistic. Both $ and 
T take values in R^. 

Re-expressed in terms of $, we can write 

p{x\(f)) = exp{a{x) - k{(f)) + (f)^t{x)} (2) 

where necessarily 

gfcW = J exp{a(x) +0^t(x)}c/m(x). (3) 

It can be showrH that (with respect to an appropriate underlying measure u on 
R^) the marginal distribution, say, of T = t{X), when $ = 0, will have density 
q^{-) of the form: 

q^{t) = exp{a{t) - k{(P) + (f^t) (4) 

— although calculation of the function a may not be straightforward. In partic- 
ular, the family Q = {Q^} of distributions of T is itself an EF. 



^Take care not to confuse this with the exponential distribution, or with the exponential 
model (a family of exponential distributions). 

^Indeed, T is minimal sufficient (sec §0]of Chapter Hll below) whenever u{G) contains an 
open set. 

^See Corollary 13 of Chapter Ull below. 
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In order for ^ (and hence (jl))) to define a density, it is necessary and sufficient 
that k{(f)) given by be finite. The set # C [Rp for which this holds is the natural 
(or canonical) parameter space, and the EF is full when $ can take any value in 



Lemma 1 The natural parameter space # is convex. The function k is strictly 
convex on #. 

We shall always suppose that # contains an open set in W. This does not lose 
any generality. 

For, if not, let g < p be the dimension of the afHne span of <t>. We can reexpress $ S <!' as as 
an afEne function + c of a new parameter ^ 6 R', and thence in terms of >!' as natural 
parameter, and new natural statistic S = A^T e IR9. Convexity and full affine dimension of the 
new natural parameter space imply that it contains an open set in R'. 



The function k is very important, containing essentially all the information about 
the family Q. Nevertheless it does not appear to have a well-established name. 

The moment- generating function (MGF) : R*' — t- [R+ U {00} of is given by: 



So long as G #°, the interior of # — which we now assume — this will be finite 
for s in a neighbourhood of 0. 

The cumulant generating function (CGF) of is thus 



In view of this, we shall call k{-) the cumulant function of the EF. 

From general theory, knowledge of its (finite) MGF, or equivalently of its CGF, in 
a neighbourhood of completely determines a distribution. So for an EF, knowl- 
edge of /c in a neighbourhood of G #° completely determines the distribution 
of T when $ = </>. 

Further, the first two derivatives of at — or, equivalently, of at <^ — 
give the mean and variance (dispersion) of the associated distribution Q^. We 
can also show these properties directly. Differentiate dH} with respect to (pj (and 




exp{A:(0 + s) - A;(0)}. 



K^(s) 



log M^{s) 

k{<t> + s)-k{(t)). 
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assume that this can be passed through the integral, which is indeed the case for 
G We obtain: 

^Me'^W = f tj{x) exp{a(x) + ft{x)} dm{x). (5) 

Dividing this by Q yields: 

dk{(t)) 



tj{x)p{x I 0) dm{x) 
= ^4T,). (6) 

The mean value parameter H := E$(T) is an important transformation of $. 
From © , the values t] and of H and $ are related by 

V = VA;(0). (7) 

This transformation of $ into H is 1-to-l. The inverse transformation g, express- 
ing the natural parameter $ as a function g{li) of the mean- value parameter H, 
is known as the (canonical) link function of the EF. 

A further differentiation yields: for G 

^^ = cov,(T,,T,). (8) 

In particular, var0(Tj) = c}^/c(0)/50|. For p = 1 we thus have /c"(0) = Yai^{T) > 
0, in accordance with Lemma ^ 



3.1 Examples 

Example 1 Normal location model J\f(M, vq) 

Suppose that the variance V of the normal distribution is known to have value 
vq, the unknown parameter thus being just M. With X = R, we have density 
(with respect to Lebesgue measure): 

p{x\ix) = (2vrt;o)-^exp-i /(^^\ (9) 

2 1 I'o J 

Equation (jH)) can be written in the EF form (for p = 1), with: 

1 f 

«(^) = -o ^ log(27rt;o) H 

t{x) = X. 
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In particular the natural sufficient statistic is T = X (whence the mean-value 
parameter H is E{X \ M) = M), and the natural parameter is $ = M/vq. The 
cumulant function k is given by k{(j)) = ^VQ(j)'^, and we readily check k'{(f)) = 
= /i = E(X), = Wo = var(X). □ 

Example 2 Binomial model B{n; 11) 

Now X = {0, 1, . . . ,n}. Our parameter is 11 G [0, 1]. With respect to counting 
measure, the density is 

pix I vr) = • ,, vr- (1 - vr)— . (10) 
xl[n — x)l 

This can be written in the EF form with: 

a{x) = logi— -^^^^ — -\ 
lxl{n — xy. ) 

hiji) = n log(l — it) 
u{n) = log 
t{x) = X. 

Again the natural sufficient statistic is T = X, and so the mean- value parameter 
is H = nil. The natural parameter is $ = log{n/(l — H)} (whence H = ne*/(l-|- 
e*)). We have k{(f)) = ralog(l + e"^), and check: k'{(f)) = ne^/{l + e"^) = E(X), 
k"\<p) = ne^/{l + e^f = mr{l - tt) = var(X). □ 

3.2 Repeated observations 

Suppose (Xi) are independent and identically distributed from the EF with den- 
sity (d}. Then the joint density of X = {Xi, . . . , X^) (with respect to the product 
measure m" on X") is 

n 

p(xi0) = n^'^^*!'^) 

1=1 

= exp{a„(x) - /c„(0) + 0^t„(x)} (11) 

with a„(x) = Xir=i^(^i)' = Zir=i^(^i)' ^ni<P) = nk{(j)). Hence, as 

varies in #, the joint distributions of X again constitute an exponential family. 
The natural parameter $ is unchanged, the natural statistic is T = ^}^=it{Xi), 
and the mean-value parameter and the function k are both multiplied by n. 
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We note that, under repeated sampling, the dimensionahty of the sufficient statis- 
tic remains bounded (by p) as the sample-size n increases. Conversely, under 
certain condition^ this property will hold only for an EF. 



3.3 Exponential family likelihood 

Suppose, for the EF of (^, that we observe X = x. The associated likelihood 
function on 8 is 

Le{e) oc exp{6(^) + u{eft{x)} (12) 
In particular, in terms of the natural parameter $, the log-likelihood function is 

/ci,(0) ~ -A;(0) +0'^t(x). (13) 

The maximum likelihood estimate of $ (if it exists) is the maximiser of 
Since k is convex, Z$ is concave, and thus has at most one finite maximum on 
the convex set #. If exists and lies in the interior #° (which will often but not 
always be the case), differentiation of (I13|l shows that it will satisfy the likelihood 
equation: 

^ = t,(x) (j = l,...,p). (14) 
Conversely when fll4p has a solution in #, this will be the MLE. 

From we see that equation (fT^ is formed by equating the observed and 
the expected value of T. This directly delivers the MLE rj = t{x) of the mean 
value parameteter H; for any other parameter function, B = (?(H) say, we have 
6 = g{t{x)}. Whenever this yields a solution within the parameter-space, that 
will be the MLE. For a random sample of size n, we similarly obtain the MLE 
by equating the population and sample means of T. 

One measure of the precision of estimation is the curvature of the log-likelihood 
function at its maximum, as given by its negative second derivative. For the 
parameter $, this is just the second derivative of k, evaluated at 0. Since k"{(f)) = 
{d/d(f))E{T \(f)), this is large (and so the log-likelihood function is spiky) when 
E(T I (p) is changing rapidly in the neighbourhood of 0, which intuitively suggests 
that T is very informative about 0. 



For $ G the p-dimensional generalisation of k"{(j)) is the matrix {d'^k{(j))/d(j)iC 
This is a special case of the Fisher information matrix 

-{d^ hgp{x I e)/deidej}p{x \ e) d^i{x) 



— of which the most important is that the set of possible values for X does not depend 
on the value of 8 
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Chapter II 

Principles of Inference 



Consider a statistical experiment S = £x = (X, X, 6, 6, V). 



1 Sufficient statistic 

Definition 1 Let T be a measurable space, and t : X — ?■ T a known measurable 
function. The variable T := t{X) is a statistic. □ 

If, instead of learning X, I learn only T, I will generally have lost information. 
But sometimes such reduction of our data can be effected without loss of useful 
information about 0. 

Definition 2 Sufficient statistic 

A statistic T is (Fisher) sufficient for the statistical model V (or for its parameter 
0) if there exists a specification of the conditional distributions for the full data 
X, given T, that serves for each P G P. □ 

Example 1 Suppose, given A = A, the (Xj) are independent and identically 
distributed as V{X), with density (probability mass function) p{x \ A) = e'^A'^'/x! 
(x = 0, 1, . . .). Then T := J2i=i -^i ~ V{n\). Thus, when each Xj = 0, 1, . . . and 
t = X]i=l ^i'- 

nr=i Prob(X, = x,|A = A) 



Prob(X = x|T = t,A = A) 



Prob(T = 1 1 A = A) 
t\ 



nr=i ^^ 

Since this expression does not depend on A, the statistic T is sufficient for A. □ 

17 
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Example 2 Suppose X = where the (Xj) are independent and 

identically distributed with an arbitrary continuous distribution over X = IR. 
Consider the function t : IR" — which is the order statistic of the data, 
obtained by arranging the values (which, by continuity, can be assumed distinct) 
in increasing order — thus losing the information about their original order. It 
is intuitive, and can readily be shown, that the conditional distribution of X, 
given t(X) = t, is the discrete uniform distribution over the set of all n\ possible 
order ings of the given values t. Since this does not depend on the underlying 
distribution, the order statistic is sufficient for the statistical model comprising 
all continuous, independent and identically distributed probability models on R". 

□ 



2 Sufficiency principles 

The motivation for introducing sufficiency by means of Definition |2] is as follows. 
Let T be a statistic. We can consider the experiment £x generating the full 
observable X (probabilistically determined by the value of 0), as performed in 
two stages. In the first stage, we generate T from its marginal distribution (which 
will typically depend on the value of 6), thus performing the marginal experiment 
£t- Secondly, having obtained say T = t, we generate X from its conditional 
distribution, given T = t (and 6): thus performing the conditional experiment 

£x\T=t- 

If now T is sufficient, the probabilistic structure of the conditional experiment 
£x\T=t is fully determined, independently of whatever value 6 may take. It can 
thus be treated as generating "pure noise", and so entirely uninformative about 
0: indeed, we could convincingly mimic it ourselves using a randomising device 
such as a roulette wheel. This argument suggests that we can safely ignore the 
conditional experiment Sx\T=t-, for the purpose of inferences about 0, and base 
these entirely on the marginal experiment St- This in turn would mean that 
the sufficient statistic T must contain all the useful information about that is 
available from conducting the experiment Sx- 

This intuition is formalised in: 

STRONG SUFFICIENCY PRINCIPLE [SSP] 

If T = t(X) is a sufficient statistic, then the inference to be drawn from observing 
X = a; in Sx should be the same as that to be drawn from observing T = t[x) 
in the marginal experiment St- n 

More precisely, let X be any proposed inference method, intended to be applied 
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across a variety of experiments £: I{£x, x) is the inference X produces when the 
outcome of experiment £x is X = x. The we can say X respects SSP if, whenever 
T = t{X) is a sufficient statistic in £x, T{£x,x) = X{£T,t{x)). 

A weaker variant, referring only to the overall experiment £ = £x, is: 

WEAK SUFFICIENCY PRINCIPLE [WSP] 

If T = t{X) is a sufficient statistic, and t{x) = t{x'), then the inference to be 
drawn from observing X = x {in £) should be the same as that to be drawn from 
observing X = x' (in the same experiment £). □ 

That is to say, an inference method X respects WSP so long as X(£, x) = T{£, x') 
whenever t{x) = t{x') for any statistic T that is sufficient in £. Clearly 



SSP =^ WSP: Any inference method respecting the strong sufficiency 
principle will automatically respect the weak sufficency principle. 



We shall see later how these sufficiency principles apply to various specific infer- 
ence methods. 



Example 3 Applied to Example WSP requires that we should ignore the 
ordering of the data in our inferences: whatever inference we may make from 
some data-sequence {xi, . . . ,Xn), we should make the identical inference from 
a different data-sequence that differed from this only by a permutation of the 
elements. For example, use of the sample mean A„, or of the sample median, as 
an estimator would respect this principle. However, if we were to use Xn/2 we 
would be violating WSP. □ 



3 Sufficiency and likelihood 



For further analysis we will assume that there is an underlying, measure on 
such that each Pg has a density, say p{x \ 9), with respect to /iu 



^Note however that this does not hold for Example 12 if we allow — as we can — continuous 
distributions that are not absolutely continuous. 
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Theorem 1 (Fisher-Neyman factorisation criterion) The statistic T = t{X) 
is sufficient for if and only if we can express the density p{x \ 9) in the form 



Proof. A simple proof for a discrete sample space was given in STATISTICS IB. The following 
is a more general argument. For simplicity, we suppose that, for some 0o S ©i 



Let T := t(X) be the set of possible values for T. We note the following: 
(i). Assuming j^, the factorisation Q will hold if and only if, for all S S G, we can express 



(ii) . p(x\9) = p(x\do)b(x,9) if and only if, for any function / : X ^ R, ^e{fiX)} = 

Eg„{f{x)b{x,e)}. 

(iii) . (Kolmogorov's general definition of conditional expectation). Suppose E{y) exists (i.e. 

E{\Y\) < oo). Then E{Y \T) = k{T) if and only if, for any function fli(T) of T, 
E{<P{T)k{T)} = E{4>{T)Y}. 

(In both [(Ii)| and [(iii)| the equality is to be interpreted in the sense that whenever either side 
exists, so does the other, and they are equal). 

Suppose first that T is sufficient, with density say q{t \ 8) (with respect to some measure 
on T that we need not specify). Then implies q{t\8o) > 0, all t £ T. Define a{t,6) : = 

q{t\e)/q{t\eo). 

For arbitrary k(X) with finite expectation, let k{T) := Eg{k{X) \ T}, chosen not to depend 
on 9, as is possible by sufficiency. Then 





p(x I 6»o) > for all x e 



pix\e) =p{x\eo)a{tix),e}. 



(3) 



EeiKX)} 



EeMT)} 
Eg„{K{T)a{T, e)} 
Ee„{k{X)a{T,e)} 



since k(T) = Eg{k{X) \ T} 

by |(ii)| applied to the distributions of T 



on applying [(iii)] since k{T) = Eg^,{k{X) | T}. 



On using [(ii^ and [(i^ we see that holds. 

Conversely, suppose we have a factorisation as in Q, or cquivalcntly ^1 
arbitrary functions k{X), f(T), and let k(T) := Egg{k{X) \ T}. Then, using]^ 



holds. Consider 



Ee{/(r)K(T)} 



Ee„{f(T)KiT)a(T,e)} 
Eg^{f(T)k{X)a{T,e)} 
Eo{f{T)k{X)}. 



So, by|(iii)| k(T) := Eg{k{X) | T}. Since Eg{k{X) | T} is thus the same for all 9, for any function 
fc(X)B T is sufficient for 6. □ 



It is enough to consider all indicator functions of sets. 
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It readily follows that the natural sufficient statistic in an exponential family is 
indeed sufficient. 

Corollary 2 Suppose T = t{X) is sufficient for Q, and xi,X2 G X are such that 
t{xi) = t{x2)- Then, as functions of 9, p{xi \ 6) oc p^x^, \ 0). 

Thus Corollary El says that, whenever two possible outcomes of the same exper- 
iment yield the same value for some sufficient statistic, the likelihood functions 
(see Definition 121 of Chapter Qj) that they generate for B must be proportional. 

Corollary 3 Suppose T = t{X) is sufficient for Q, with density q{t \ 9) (with re- 
spect to some measure onl). Then, for any x G X, as functions of 9, q{t{x) \ 9} oc 
p{x I 9). 

Proof. Examining the proof of Theorem [Tl we sec that q{t{x) \ | 0o} = = 

p(x\e)/p{x\eo). □ 

That is to say, we will always obtain proportional likelihood functions, whether 
we observe the full data X oi a. sufficient reduction T. 



4 Minimal sufficiency 

In general there are many non-equivalent sufficient statistics. Thus suppose Xi ~ 
7\/'(G, 1) {i = 1, 2, 3). Then the following statistics are sufficient: Tq := Xi-\-X2 + 
X3; Tl := (Xi,X2 + X3); T2 := (X^ + X2,Xs); T3 := the order statistic of 
{Xi, X2, X3). Note that Tq is a function of each of the others, and intuitively it 
does not seem possible to find a non-trivial function of Tq that will be sufficient 
— i.e. Tq is (apparently) a minimal sufficient statistic. 

In general, for functions s, t on X, write s ^ t if S := s{X) is a function of 
T := t{X); that is, t{x) = t{x') =^ s{x) = s{x'). We also write S ^T. 

Lemma 4 Suppose T is sufficient, and T ^ S. Then S is sufficient. 

Proof. For compatible values x, s, t of X, S, T, p{x \ s,9) = p{x \ s, t, 9), which, 
being obtained by further conditioning p{x \ t, 9) on S, can be chosen to be in- 
dependent of the value 6* of O. [Alternatively, using the factorisation criterion: 
if p{x I 9) = h{x)g{t{x), 9}, this can also be written as p{x \ 9) = h{x)'-f{s{x), 9}]. 

□ 
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Intuitively, if all the usable information about is contained in T, it will be 
retained if we add further data to T. But T constitutes the more effective reduc- 
tion. 

Definition 3 A statistic T is called minimal sufficient if it is sufficient and, for 
any sufficient statistic S, T ^ S. □ 

If it exists, a minimal sufficient statistic (MSS) T is the most compact reduction 
of the original observable X that does not discard any useful information about 
G. It is essentially unique: if both T and T' are minimal sufficient, they are 
related by an invertible transformation, and so embody the same information. 

It is not obvious that a MSS need exist. However the following constructive 
argument shows that it does. 

Theorem 5 A minimal sufficient statistic T exists^ 

Proof. For x, x' G X, write x ~ if the likelihood functions based on observing, 
respectively, X = x and X = x', are proportional (as functions on 6). That is, 
for some c > 0, 

p{x I 9) = cp{x' I 6) (4) 
for all 6 E &. Note that c will usually depend on x and x': c = c(x, x'). 

It is easy to see that this is an equivalence relation on X. Within each equivalence 
class, select one representative point, let T be the set of all such representative 
points, and define t : X — > T by: t{x) is the representative of the equivalence class 
containing x. Then t{x) = t{x') <^=^ x ~ x'. 

By Theorem if S* = s{X) is sufficient, and s{x) = s{x'), then x ~ x', so t{x) = 
t{x'). Hence T < S. Moreover, since x ~ t{x), p{x\6) = c{x,t{x)) p{t{x) \ 9}. 
Hence, again by Theorem ^ T is itself sufficient. □ 



In applications we would not normally use this particular form of minimal suffi- 
cient statistic: any statistic T = t{X) such that t{x) = t{x') if and only x and x' 
yield proportional likelihoods will serve. 

Corollary 6 In order to respect WSP in an experiment Sx, it is necessary and 
sufficient that we draw the identical inference from observations x and x' G X 
whenever x and x' yield proportional likelihood functions for B (i.e., (0) holds). 

■^Our argument is not entirely rigorous for a non-discrctc sample space, since we do not fuss 
about measurability or sets of measure 0; but for all but pathological counter-examples these 
gaps can be filled by suitable technical elaborations. 



5. COMPLETENESS 
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5 Completeness 

The following technical definition, while not easy to motivate intuitively, proves 
to be of importance for a variety of theoretical developments. 

Definition 4 Given a statistical model V = {Pg : ^ G 6}, a statistic T (taking 
values in T say) is called complete if it admits no non-trivial unbiased estimator 
of 0: that is, whenever /i : T — )■ IR satisfies 

Ee{h{T)} = (all 6 e B) (5) 

then h{T) = (almost surely for any Pg). 

A slightly weaker property is bounded completeness, in which the above is only 
required under the additional condition that the function h be bounded. □ 

If T is [boundedly] complete and also sufficient for P, it is called (unsurprisingly) 
[boundedlyj complete sufficient. 

Example 4 Exponential model 

Given A = A G R"^, Xj ~ £'(A), independent and identically distributed. Then 
T := XliLi -^i sufficient for A. The distribution of T, given A = A, is T{n, A), 
with density r^^"~^e~'^*. To show that T is complete, let : [R+ — )■ [R be such 
that Exh{T) = for all A > 0. That is, 

POO 

/ /i(t)r-^e-^*rft = o, 

Jo 

all A > 0. Let h+(t) := max{h(t), 0}, h-(t) := max{— /i(t), 0} be the positive and 
negative parts of h. Then 

/•OO /"OO 

/ /i+(t)r-^e-^*rft = / /i_(t)r-^e-^*rft, 
Jo Jo 

all A > 0. By the uniqueness of Laplace transforms, we must have h^{t)t"'~^ = 
h^{t)t^~^ (almost everywhere), i.e. h{t) = (almost everywhere). □ 

The above argument can be generalized, to show: 

Theorem 7 If T is the natural sufficient statistic of a p-parameter exponential 
family, and the parameter space Q contains an open set in W, then T is complete. 
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Lemma 8 IfTis boundedly complete sufficient, it is minimal sufficient. 



Proof. Let S be minimal sufficient; it is thus a function of T. Consider a bounded real 
function h{T) of T, and define k{S) = Eg{h{T) \ S} (defined, as is possible by sufficiency of S, 
to be independent of the underlying parameter-value 9). Then, for any 9 £ ©, Eg{h{T)} = 
Ee[Ee{/i{T) | S}] = \Eg{k{S)}, so h{T) — k{S) is an unbiased estimator of which is a function 
of T. By completeness, we deduce that h{T) = k{S), so that h{T) is a function of S. Since this 
holds for every bounded real function h(T), T is a function of S, and hence minimal sufficient. 

□ 



In particular, under the conditions of Theorem |7J the natural sufficient statistic 
of an exponential family will be minimal sufficient. 



6 The likelihood principle 



Here is another putative "Principle of Statistics" : 
LIKELIHOOD PRINCIPLE (LP) 

Let £ = {I.,X,e,Q,V),S' = (X', X', 6, 6, P') be two possibly different exper- 
iments governed by the same parameter 0. Denote the sampling densities by 
p{- 1 6) for £, and p'{- \ 6) for S' . 

Suppose that, for a specific pair of values x e X and x' G X', the corresponding 
likelihood functions are proportional: 

p'{x' I 6) oc p{x I 6) (6) 

where the constant of proportionality may depend on x and x', but not on 9. 

Then we should make the same inference about 6, whether we observe X = x m 
S, or X' = x' in S'. □ 



That is to say, a general inference method X respects LP if X(£,x) = X{£',x')) 
whenever (jH)) holds. 

By virtue of Corollary IHl WSP is equivalent to the Restricted Likelihood Principle 
(RLP), in which LP is only applied to the case £ = £' , X = X'. Similarly SSP 
is equivalent to the special case in which £ = £x and £' = £t, the marginal 
experiment of £x for a sufficient statistic T. In particular, we see that 

LP =^ SSP: Any inference method respecting the likelihood princi- 
ple will automatically respect the strong (and hence also the weak) 
sufficency principle. 



6. THE LIKELIHOOD PRINCIPLE 
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How much more broadly should we regard LP as appropriate? Do you agree with 
its application in the following case? 

Example 5 Suppose £ specifies ~ V(toQ), while £' specifies T ~ T{nQ,Q) 
(tQ,nQ fixed). Note that the sample spaces are quite different, being discrete for 
£ and continuous for £' . The likelihood function based on observing = no in 
£ is proportional to e"*"^^"", as also is that based on observing T = to in £' . 
Hence LP tells us to make identical inference about in both these cases. □ 



6.1 Optional stopping 

Suppose that variables Xi,X2, ■ ■ ., with a joint distribution depending on 0, arrive in sequence. 
At any time i, after having observed the values of X* := {Xi, . . . , Xt), you can choose either to 
stop observing, or to continue. This decision may depend on the data-values observed so far, 
and may even involve some randomisation; but we assume that it is not further influenced by 
the value of © — such a stopping rule is termed non-informative^ Let s(x^) be the probability 
you will stop observing, after having observed X* = (xi, . . . ,xt), and c(x*) ;= 1 — the 
probability of continuing. Each choice for the function s specifies a different statistical experi- 
ment, £s say, but all are governed by the same parameter © (and hence, necessarily, share the 
same value 6 of 0). 

In £[i, the probability-density of observing exactly Xi = xi, . . . , Xt = xt, when © = 6, is 

Vs{x^\e) = c(fli)p(xi\e)cixi)p(x2\xi,e) ... c{x'-i)p(x* |a:*-\9)s(a:*) 

t 

oc Ylp(x,\x'-\e) 

= P(^'\9)- 

Since this holds for any non-informative stopping rule s, any inference method that respects LP 
must make the same inference about © from the observation X* = x*, no matter which stopping 
rule gave rise to it. 



Example 6 (Sec also Question 3(b) of Example Sheet 1.) Suppose that we are told that there 
were 9 heads in 12 tosses of a coin, but not which (non-informative) stopping rule was applied. 
It might have been: "Toss exactly 12 times" (a binomial experiment), or "Toss until the 3rd tail 
appears" (a negative binomial experiment). Or perhaps "Toss 10 times, and then stop after the 
first head" . All of these different experiments have different sample spaces, and would support 
different unbiased estimators for — so the method of unbiased estimation violates LP. In fact, 
pretty well every frequentist method violates LP in this setting. However, according to LP we 
should not care which was experiment was actually performed, but make the identical inference 
in all these cases. □ 

Example|21can be regarded as a continuous-time variant of the above set-up, where we observe 
a Poisson process of rate 0, either for a given time to, or until we have seen a given number no 
of events. 



For an example of informative stopping, consider an experimenter who tossed the coin 20 times, 
but chooses to reveal only the first (say) 14 outcomes, because that maximises the proportion of 
heads. 
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7 Ancillarity 

A statistic 5" = s{X) is called ancillary for if the distribution of S given = 9 
is the same for all ^ G 6. 

Example 7 A fair coin is tossed. If it lands H, X is generated from A/'(0, 10^), 
if T then from A/'(6, 50^). The outcome of the coin-toss is ancillary. □ 

Example 8 Xi,X2, . . . are independent with a joint distribution depending on 
0. An integer TV with known distribution is generated, e.g. by a randomising 
device. If = n, we observe Xi, . . . Then the sample size A^ is ancillary. 

□ 



Example 9 Xi, X2 are independent and identically distributed with the Cauchy 
distribution C(0, 1). Then S := X2 — Xi is ancillary. □ 

Again, we can consider the experiment £ generating the data X (probabilistically 
determined by the value of 9), as comprising two stages: In the first stage, we 
conduct experiment £5, generating S from its marginal distribution — which by 
ancillarity will not depend on the value of G. Then, having obtained S" = s, we 
perform experiment Sx\s=s, generating X from its conditional distribution given 
S = s and . This time we can argue that it is the first stage that is entirely 
uninformative about 6, and thus that our inferences should be based entirely on 
the second stage, the conditional experiment. 

This intuition is formalised in the 

ANCILLARITY PRINCIPLE [AP] 

If s = S{X) is an ancillary statistic, then the inference to be drawn from observing 
X = X in S should be the same as that to be drawn from observing X = x in 
the conditional experiment Sx\s=s{x)- D 

More precisely, we can say that a general inference method X respects AP if 
X{£,x) = Z{£x\s=s{x),x), for any ancillary statistic S = s{X) in £. 

When S is ancillary, we have the density factorisation p{x \ 9) = p{s)p{x \ s, 9) 
[s = s{x)). In particular, p{x \ 9) oc p{x \ s,9), so that we obtain essentially the 
same likelihood function, based on data X = x, whether we form it from the 
overall experiment £ or the conditional experiment £x\s=s- Consequently: 



8. BIRNBAUM'S THEOREM 
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LP =^ AP: Any inference method respecting the hkehhood principle 
(including Bayesian inference) will automatically respect the ancillar- 
ity principle. 

What other ways are there of respecting AP? 

Theorem 9 (Basu) Suppose that, in an experiment £, T is bounded complete 
sufficient for Q, and S is ancillary for 9. Then 5* _LL T | 0. 

Proof. Let k{S) be a bounded function of S, and g{T) := E{k{S) \ T} (independent of the 
value 6 of ©, by sufficiency of T). Then Eg{g{T)} = E{k{S), the latter being independent of 6 
by ancillarity. By bounded completeness, , almost surely for any 9 we have g{T) = E{k{S), i.e. 
Eg{k{S) I T} = Eg{k{S)}. Since k was arbitrary, the result follows. □ 



Corollary 10 Suppose that, in every experiment £ under consideration, there 
exists a bounded complete sufficient statistic; and let I be an inference method 
that, in any such £, depends only the distribution and value of that statistic. 
Then I respects AP. 

Unfortunately the hypothesis of Corollary El is very restrictive, so that it does 
not provide a useful way of ensuring AP is respected. 

Another approach — dual to basing inference on a minimal sufficient statistic — is 
to look for a maximal ancillary statistic S*, with the property that S ^ S* for any 
other ancillary statistic S. In the presence of such S*, we could operate reduction 
by ancillarity, which, starting from a given inference method X, creates a new 
method X* by applying X to the model conditioned on 5**. Then X* would satisfy 
AP. However, in contrast to the situation with sufficiency, a maximal ancillary 
typically does not exist. 

Example 10 Let (X, Y) be bivariate normal with means 0, variances 1, and 
unknown correlation B. Then each of X, y is separately ancillary, so the only 
possible choice of a maximal ancillary would be {X, Y) — which is clearly not 
ancillary! □ 



8 Birnbaum's theorem 

We have seen that the likelihood principle implies both the (strong or weak) 
suficiency principle and the ancillarity principle. Conversely, Birnbaum showed 
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that the only general way to respect both WSP and AP is to respect LP. In fact 
we do not even need the full force of AP, but only the weaker: 

CONDITIONALITY PRINCIPLE [CP] 

Let £i = (Xi, Xi, 6, 6, Vi), £2 = (X2, X2, 6, 6, V2) be two experiments governed 
by the same parameter 0, and let the compound experiment £ be conducted by 
first flipping a fair coin, with outcomes labelled 1 and 2; and then doing £i if 
the outcome is i. Then the inference drawn from observing outcome [i, Xi) in the 
overall experiment £ should be the same as that drawn from observing outcome 
Xi in the component experiment £i. □ 

More precisely, a general inference method X respects CP ifX(5, (z, Xj)) = X(£j, Xi). 

Theorem 11 (Birnbaum) 

WSP & CP ^ LP 

That is to say, if I is a general inference method respecting both WSP and CP, 
then it must respect LP. 

Proof. Let pi{- \ 9) be the density for in £i when = 6*. Then in £ the 
outcome {i,Xi) has probability/density p{i,Xi \ 6) = 

Suppose pi{xi I 9) oc ^2(2^2 I 6*)- Then in £ the outcomes (l,a;i) and (2,X2) have 
proportional likelihoods, and so have the same value for the minimal sufficient 
statistic. Hence, applying WSP, X(£, = X(£, (2,X2)). Also, by CP, 

X{£, {i,Xi)) = X{£i,Xi). We deduce X(£i,X2) = X{£2,X2); so X respects LP. 

□ 

This mathematically trivial theorem caused a great deal of consternation when 
first announced, since while most statisticians believed in WSP and CP, they 
could not accept LP. It is fair to say that its full force has still not been properly 
appreciated. 



Chapter III 

Frequentist Estimation Theory 



The frequentist approach to inference is indirect: it focuses on constructing and 
evaluating inference procedures of various kinds, constructed before any data are 
gathered and intended to be apphed to whatever data are subsequently observed. 
It replaces statistical inference from the observed data by probabilistic properties 
of procedures. These are assessed and compared by means of various more-or-less 
appealling criteria for the "goodness" of a procedure. Once a favoured procedure 
has been identified, it can be applied to the data observed to produce the relevant 
inference — in the hope that any good properties it may have as a procedure are 
somehow retained in its application to the specific data to hand. 



1 Frequentist evaluation of estimators 

We suppose specified a statistical experiment = with observable X G /""111 

One common task of statistical inference is to come up with a point estimate of 
the unknown parameter 6, or some chosen function of it (the estimand) , in the 
light of the observed data x. In addition, given that we do not expect to hit the 
nail exactly on the head, we would like some idea as to how close we might be. 
A purely likelihood-based approach might suggest using (say) the MLE and the 
curvature of the observed log-likelihood function there. The frequentist approach 
needs to take a step back, and consider what might be done ahead of gathering 
the data. 

Definition 1 Estimator 

An estimator is a function 6' : X" — 6. We also use the term to refer to the 
"'^Much of the description and analysis applies just as well to more general experiments. 
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random variable O := 0(X.). □ 



Once we have decided to use this estimator, our estimate, for observed data x, is 
^(x). 

The estimator G is a function of X, and so, for each possible value 6 of 0, has 
a distribution that can in principle be calculated from the distribution Pg of X. 
A frequentist bases her evaluation of the proposed estimator on this family of 
distributions for G given 6. But to do this she needs to apply specific criteria for 
such an evaluation. There is plenty of scope for developing and analysing (and 
criticising) "reasonable" evaluation criteria. 

Here is one possible criterion, for the case that G C R0 
Definition 2 Unbiasedness 

The bias function of the estimator of is the function 6 : 6 — )■ [R given by: 

b{e) = E(0 \e)-e. 

is an unbiased estimator of if its bias is identically 0, i.e., for any 6 E &, 

E(0 \e) = e. 

□ 



When we need to distinguish this bias function from that of other estimators, we 
might write b^i^-), or &(■),. . . 

When unbiasedness holds, the strong law of large numbers implies that, if we were 
to apply the estimator repeatedly to independent repetitions of the experiment 
£nU ci-ll with the same parameter- value, then the overall average of our estimates 
would converge to the true parameter value, with probability 1, no matter what 
that true value might be. This is the "repeated sampling" interpretation of the 
unbiasedness criterion. 

While unbiasedness may seem like a natural and appealing property, it has its 
problems. 

Example 1 Exponential model 

We have density p{x \ A) = Xe~^''^ [x > 0). We might be interested in the 
expectation M = of X. Consider the estimator M of M taken to be the 



^This readily extends to 6 C R'^, or any other vector space. 
■^Note that there are two nested levels of repetition involved here. 
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sample mean X := n ^ -^i- '^^^ distribution of X, when A = A, is r(?7,,nA), 
with expectation n/nX = fi. So M is an unbiased estimator of M. 

Suppose instead our estimand were A. It seems sensible to take A := M as an 
estimator of A = M"""^. However, E(M | A) = n\/{n — 1) ^ A, so that this is not 
an unbiased estimator of A. (The modified transformed estimator {n — l)/nA 
would be unbiased for A). □ 

Example 2 Likelihood principle 

Consider two experiments: S = {X, X, G, 9, V}, and £' = {X', X', G, 9, V'}. In 
S, given 9 = X ~ B{10, 6), with 

Prob(X = x\9)= (1 - ey^-^ (x = 0, . . . , 10). 

xl (10 — x)l 

In given 9 = ^, X' ~ A/'i3(7, 6), with 

Prob(X' = x'\9) = i^l±3l 0-' (1 _ ey (x' = 0,1,...). 

6! x'l 

The (unique) unbiased estimator of 9 in £ is T := X/10; while the (unique) 
unbiased estimator of 9 in £' is T' := X'/{X' + 6). 

Consider now the observations: X = 3 in £, and X' = 3 in £' . Unbiased esti- 
mation yield different answers, respectively 3/10 and 3/9. But both observations 
yield essentially the same likelihood function, proportional to ^'^(1 — Oy in each 
case. So the method of unbiased estimation does not satisfy the likelihood prin- 
ciple. □ 

A more basic criticism is that the property of unbiasedness says nothing about 
how close the estimator might be to its estimand. In Example^ one simple 
unbiased estimator of M is just the first observation, Xi; but this seems a poor 
choice. 

To address this issue we might proceed as follows. 
Definition 3 Mean squared error 

The mean squared error (MSE) function of an estimator 9 of 9 is the function 
mse : G — )■ IR given by: 

mse(^) = E{(9-^)^|^}. 

□ 
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When is unbiased, mse(6') is just the (sampling) variance v{6) := var(0 | 6) of 
0. More generally we have 

mse(^) = v{e) + b{ey. 

If we have two suggested estimators, and 0*, of the same parameter 0, such 
that inse(6') < mse*(^) for all 6, then we might choose to use in preference to 
0*. 

Example 3 In Example d the variance function of M = X is v{fi) = fi^/n. If 
instead we used M* = Xi, we would get v*{fi) = fi^. Since both estimators are 
unbiased for M, these are also the mean squared error functions. Assuming n > 1, 
nise(/i) < mse*(/x) for all n, so on this criterion we would prefer the estimator M 
to M*. 

Consider now a third (biased) estimator of M: := nX/(n + 1). This has 
bias function b^{fi) = —fi/{n + 1), and variance function = nfi'^/{n + 1)^, 

hence mse"'"(/i) = /U^/(n + 1), which is everywhere less than fnse(/i) = /i^/n. So 
this biased estimator beats the more obvious unbiased estimator M on the MSE 
criterion. □ 

Yet another (crazy??) estimator of M is M° = 3, i.e. we ignore the data entirely 
and always estimate M as 3. Although heavily biased (fc°(/i) = 3 — //), it does 
at least have the advantage of small variance, f°(/i) = ! And indeed it is NOT 
beaten by (say) M, since when /i = 3 we have mse° = 0, whereas inse > 0. 
Of course, for fi far from 3, mse°(/i) = h°{^Y will be much larger than fnse(/i). 
But since the inequality in mse(/i) goes in both directions as yU varies, the MSE 
criterion can not select either of these estimators as being superior to the other — 
they are simply incomparable. 

In an attempt to evade such incomparabilities it is common to restrict attention to 
unbiased estimators only, which can then be compared in terms of their variance 
functions — though this is pretty arbitrary, and has the undesirable side-effect of 
excluding good biased estimators such as M^ above. 

Definition 4 Minimum variance unbiased estimator 

An estimator of is a minimum variance unbiased estimator (MVUE) if: 

• is unbiased for 

• for all 6' G 6, v{0) < v*{6) for every other unbiased estimator 0* of 0. 



2. TOWARDS A VARIANCE BOUND 
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□ 



Theorem 1 If a MVUE of B exists, it is unique. 



Proof. We use the readily verified property 

var |^(Ti + T2)| + var |i(Ti - ^2)! = ^ {var(TO + var(T2)} . (1) 



Suppose Ti and T2 are both MVUEs of 0. It immediately follows that they have 
the same variance function, v{9) say. Also, \{Ti + T2) is an unbiased estimator of 
G, so \axe{\{Ti + T2)} > v{6). Substituting in ((1} we find var6){|(Ti — T2)} < 0, 
hence vare(Ti - T2) = 0. Also, ^e{Ti - T2) = 9 - 9 = 0. Thus Ti = T2 with 
probability 1 under any Pe. □ 



2 Towards a variance bound 

How do we identify a MVUE (if one exists), or check that a proposed estimator 
is in fact MVUE? One way is to show the existence of a lower bound v{9) for the 
variance function v{9) of any unbiased estimator. Then if we find v{9) = v_{9) we 
shall know that we have a MVUE. 

We now develop some theory working towards identifying such a bound. We 
suppose a statistical experiment £„ = with 6 C [R. We further suppose 
p{x I 6') > on X, and introduce, for x e X", 9 e <Q: 

U^,9) :=logp(x|^). 

In manipulating functions such as /n(x, 9) we shall be taking integrals with respect 
to x (using the dominating measure /i" over X"), and taking derivatives with 
respect to ^ G 6 (which we indicate with a dash: '). We shall suppose that 
the functions treated are suficiently regular to allow these operations, as well as 
interchange of the order in which they are applied. 

The random variable Un{9) := /^(X, 6') (both the form and the distribution of 
which depend on the value 9 of the parameter) is called the (Fisher) score function 
or variable. 
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Lemma 2 Under condition^ allowing the interchange of integration over X and 
differentiation on 6, 



Ee{U^,e)} = 0. 

M-C(x,^)} = E4{/;(x,^)r]. 



(2) 
(3) 



Proof. Property (j21) is readily proved by differentiating the normalisation 
property 



p(x| 9) rf^(x) = 1 



(4) 



with respect to 6, and switching the order of integration and differentiation (which 
will be valid under suitable regularity conditions). 



For 0, we have 

Ee{-i:i^,9)} = Ee 



/(X 



p(X| 



Ip(x| 



The first term has expectation 0, by an argument similar to that used to show 
(jSJ. The second term is the right-hand side of dHJ- □ 



3 Fisher information 

Definition 5 The (Fisher) information function for estimation of 6 from X in 
the given model is: 

h{e) :=M-C(x,e)}. 

□ 

From (j21) and Q we also have 

I„(0)=var4C(X,0)}. (5) 

Thus, under Pg, the score variable Un{0) has expectation and variance In{d)- 

*The most important is that the range of x-values for which the integrand is positive should 
be the same for aU 6* S (B. 



3. FISHER INFORMATION 
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3.1 Repeated sampling 

For this course we are assuming the Xi independent and identically distributed 
according to Pg. Then, with 

l{x, 9) := \ogp{x I 9) 

we have 

n 

logp(xi^) = lognp(^ii^) 

n 
i=l 

whence 

n 

Un{9)=Y,<0) 

i=l 

where Ui{9) is the score variable based on the single outcome variable Xj. Thus, 
under Pe, Un{9) is a sum of n independent and identically distributed components, 
each with mean and variance i{9) := Eg{— /"(X, 6')}, the Fisher information 
based on a single observation. In particular, in this independent and identically 
distributed case, In{9) = ni{9). By the central limit theorem, the asymptotic 
distribution under Pg of the standardised score variable, Sn{9) := {In{9)}~2Un{9), 
as n — > oo, is standard normal. 

In particular, for large n we can test a hypothesis O = 6'o by referring S'„(6'o) 
to standard normal tables; essentially equivalent, we can refer 5'„(6'o)^ (which is 
called the score statisti^ to tables of Xi- This yields the score test. 

Similarly, we can form an approximate 95% (say) confidence interval for as: 

{9 : \{I^{9)}-"Wnm < 1-96}. (6) 

3.2 Change of variable 

Transformation of data Suppose Y is a smooth invertible function of X: Y = 
/(X). If we always use Lebesgue as the dominating measure, the densities 
are related by: PYiy I 9) = Px(/~^(y) I 9) 7(y), where 7(y) is the Jacobian 
of the transformation /^^ at y. A similar relationship holds more generally, 
for any choices of the dominating measures, with some adjustment factor 



""not to be confused with the score variablel 
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7(y) that does not depend on the parameter. Hence 1'y{Y,9) = /^(X, 6'): 
i.e. the score random variable is the same, whether based on X or Y. In 
particular, the Fisher information, which is the variance of the score, is 
unaffected by a data transformation. 

Transformation of parameter Now consider a smooth invertible parameter 

tranformation: $ = g{Q). Then p<i> (x | 0) = pe{x \ g~^{(j))), whence dl<s,{x,(j))/d(f) = 
{g'{9)}^^dle{x,9)/d9 (where = g{9)). We deduce the following relation- 
ships between the score variables and information functions before and after 
a parameter transformation: 

= {g'i9)}-'Uei9) (7) 
= {g'{9)}-'U9). (8) 



These can be expressed more symmetrically and mnemonically as: 

U^{(j))d(f) = Ue{9)d9 (9) 
lMid<py = Ie{e){d9y. (10) 

In particular, Ii^{(f))~^U,^{(f)) = I0{9)~^Uq{9): the standardised score variable 
(and hence also the score statistic) is unaffected by an invertible transformation 
of either the data or the parameter. 



3.3 Fisher information for the exponential family 

Consider a 1-parameter EF, with natural parameter $ and cumulant function 
k{(j)). So 

/(x,0) = ~k{(j)) + (j)t{x) 
/'(x,0) = -k'{(j)) + t{x) 
U^{4>) = l'{X,<f))=T-k'{<j)) 
-/"(x,0) = k"{<p) 

2$(0) = E4-/"(X,0)} = fc"(0) 
/$((/.) = nk"{(f)). 

The general theory confirms our earlier results that E(^(T) = k'{(f)), va.T^{T) = 
k"{(j)). 

Consider now the mean-value parameter H = A;'($). Using we find iuiv) = 
{k"{(j))}-^ (where t] = A;'(0)), and so I^iv) = n{k"{(l))}-^ 



4. THE CRAMER-RAO LOWER BOUND 

4 The Cramer-Rao lower bound 
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Here — at last — is a reason to care about the Fisher information. 

Theorem 3 (Cramer-Rao inequality) Let be an unbiased estimator ofQ. 
Then (assuming it is legitimate to interchange integration over X and differenti- 
ation on 6 — see Footnote EJj its variance function v{6) satisfies: 
For all e ee, 

v{6) > i{dr\ (11) 

Proof. Recall the Cauchy-Schwarz inequality: 

{cov{Y, Z)Y < var(F) var(Z), (12) 

which holds for any variables Y, Z, having any joint distribution (so long as the 
variances exist); moreover, we get equality in ()12|1 if and only if there is a linear 
relationship between Y and Z. 

Fixing an arbitrary value of 6, we will apply fll2|l to the variables Y = Q and 
Z = U{9), using the distribution Pg. 

Using (0), 

cov^e, f/(e)} = Ee{0{X)l'^{X,e)} 

= / e{x)p'{x\9)dfx{x) 
d f 

^ d9 J ^(^)P(^I ^) ^/^(^) 
= 1 

(where we have use the unbiasedness of 6: EfiijO} = 6.) Inserting this, and (jS)), 
into dH)), the result follows. □ 



Corollary 4 Suppose that 6 is an unbiased estimator ofQ whose variance func- 
tion V satisfies v\e) = I{ey^ . Then 6 is the MVUE ofQ. 
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Example 4 Exponential model 

Consider tlie exponential model parametrised by M = E(X | A) = A~^, so that 
p{x I /i) = exp{—x/ fi}, and so 

l{x,fi) = — log II — x/ji 
—l"{x,iJ?j = — + 2x/i~^ 



Hence any unbiased estimator of M must have variance at least /(/i) ^ = /i^/ 



n. 



But var(X | fi) = fi"^, so the variance of the unbiased estimator X of M is fi^/n. 
It follows that this is the MVUE of M. □ 



4.1 Achieving the bound 

An estimator whose variance function achieves the Cramer-Rao lower bound is 
necessarily a MVUE. However the converse is false in general, since the bound is 
not necessarily achieved. When does this happen? 

Suppose then that an unbiased estimator = ^(X) of some parameter O has 
variance function achieving the Cramer- Rao lower bound. The linearity condition 
following fll2|) states that equality in fjll|) holds if and only if there exist coeffi- 
cients B and W, possibly depending on 6, such that (with probability 1 under 
Pg"), /^(X, e) = B + We. Thus I'^i^, e) must have the form B{e) + W{e) ^(x), 
and so p(x | 9) = exp /n(x, 6) must have the form 

p(x I 6) = exp{a(x) + b{e) + w{e) ^(x)}. 

We thus see that, for equality in we must be dealing with an exponential 
family of distributions for X. In the independent and identically distributed case 
we confine ourselves to, this in turn requires that the basic single-observation 
model p{x \ 9) be an EE. 

Moreover, our estimator must be the natural statistic T. Since we are requiring 
that be an unbiased estimator of 0, that in turn implies that must be the 
mean- value parameter H of the EE. So we can achieve equality in flll|) if and only 
if we are estimating the mean-value parameter in an exponential family; and to 
do so we must be using the natural sufficient statistic. 

Example 5 Suppose Xi ~ ^(0), X2 ~ P(0^), independently. Then 

l{x,9) ~ -9 -6"^ + {xi + 2x2)\og9 



5. FISHER INFORMATION: MULTIPARAMETER CASE 
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u{e) = i\x,e) = -1-29 + ^''^^^' 



-i"{x,e) = 2 + 



9 

Xi + 2X2 
9^ 



This model forms an exponential family, with natural parameter $ = logG, 
natural statistic T = Xi + 2X2, and mean-value parameter H = O + 20^. 

Note that ^e{U{9)} = 0. Also /©(^) = Ee{-l"iX,9)} = 4 + 9-\ so the Cramer- 
Rao lower bound for an unbiased estimator of G is 9/{l + 49). The estimator 
O := Xi is clearly unbiased for 6, with variance function v{9) = 9. This is close 
to the CRLB when 9 is small. 

Now consider the parameter A = O^. Using (jH}, or directly, we find the CRLB 
for unbiased estimation of A is /a(A)^^ = 49^/ (1 + 49). The unbiased estimator 
X2 of A comes close to achieving this for large 9. 

Finally, if we just happened to be interested in the mean-value parameter H = 
6 + 26^, we would find the unbiased estimator T = Xi + 2X2 exactly achieves 
the variance bound I-^{ri)~^ = 9 + 49"^. □ 



5 Fisher information: multiparameter case 

Suppose now the parameter © = (0i, . . . , ©p)'^ S © C RP. For a function / : G — > R, we define 
the first derivative, f'{0), of / as the (p X 1) vector of partial derivatives with respect to each 
element of 0: 

/ df{0)/de^ \ 
f{e) ^ ■ 

In particular, the score variable, Un(^) = now a (p X 1) random vector. 

We define the second derivative, f"{0), as the symmetric (p X p) matrix of mixed second 
partial derivatives of / with respect to two elements of 0: 



f"{0) 



f d^f{0)/del 



V d^fi0)/depd9i 



d^f{0)/de^dep \ 
d^f(0)/del I 



In particular, the information function ln{0) = ^Eg{l"(X,0)} is now a symmetric (p X p) 
matrix. 

In parallel with the single-parameter results, we can show that the score vector Un{0) has 
(under Pg ) expectation and dispersion matrix I„{0). Letting i(0) ^ denote any (non-random) 
matrix square root of i(0) (i.e. a non-random matrix A{d) such that A{0)A{0)'^ = i{0)), and 
ln{0)^ = n2 i(0)2, the asymptotic distribution of ln(0)~^ Un{0) is that of p independent 
standard normal variables. In particular, the asymptotic distribution of UniO)^ In{0)~^Un{0) 
(which is invariant under transformations of either data or parameter) is Xp- 

Now suppose ©1 is an unbiased estimator of ©i . Similar to before, we find covgj©! , Uj (0)} = 
1 for j = 1,0 otherwise. It follows that, for any non-random vector c (which may however depend 
on 0) we have 



g ©1,C^U(0) 
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Applying Cauchy-Schwarz, wo get 

varg(ei) > cf/c'^/(0)c. (13) 

The right-hand side of lliji is maximised when c oc I{0)~^ 5i, where di is the (p X 1) vector 
whose first entry is 1 and all others arc 0. The maximised value is then I^^{9) = I{0)~^Si, 
the (1, 1) entry of 7(0)^^0 Hence any unbiased estimator of 0i must have variance at least 



6 Sufficiency and optimal estimation 

If S is any statistic, and T is sufficient, then we can form the conditional ex- 
pectation of S given T, S* := E{S \ T), which will be a function of T and, by 
sufficiency, will not depend on which value of 6 is used in its construction. It is 
thus another statistic. 

The following result from STATISTICS IB enables us to improve (in MSE terms) 
on any estimator that is not already a function of a sufficient statistic: 

Theorem 5 (Rao-Blackwell) Let T be sufficient, and let S* := E{S\T). Then: 
(t). Ee{S*) = Eg{S). 

(a). Yaxgi^S*) < vaie{S), with equality if and only if S is a function ofT (so that 
S* = S). 



Proof. The following formulae hold for any variables S,T with any joint distribution: 

E{S) = E{E{S|T)} (14) 
var(5) = E{var(S|T)} -f var{E(5|T)}. (15) 

Apply these to the case in hand, under distribution Pg, using Eg(S\T) = S* . Then [(i)] follows 
directl y fro m 1141 . Also, s ince the first term on the right-hand side of I15i is non- negative, we 
obtain [{iTyi For equality in |(ii)| we must have Ee{varg(S | T)} = 0, which implies varg(S' | T) = 0, 
i.e. S is fully determined by T. □ 



Corollary Q If a MVUE of G exists, it is a function of any sufficient statistic. 

6 Beware: This is not the same as the inverse of IniO), the (1, 1) entry of I{0)\ that would be the 
C-R bound for the case that the values of all the other parameters wre fixed and known. In fact 
I^^iO) > /ii(0)^^: we lose precision by not knowing the other parameters. 



6. SUFFICIENCY AND OPTIMAL ESTIMATION 
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It follows in particular that a MVUE (if it exists) must be a function of the 
minimal sufficient statistic. Conversely, if there is a unique unbiased estimator 
based on the minimal sufficient statistic, this must be the MVUE. 

Example 6 In Example |S] we had Xi ~ V{Q), X2 ~ V{Q^), forming an expo- 
nential family with sufficient statistic T = Xi + 2X2 "Rao-Blackwellisation" of 
the unbiased estimator X2 of 0^ produces the improved (and in fact MVUE — see 
below) estimator k{T) = E{X2 \ T), where, explicitly, 

^ E{(x-l)!(t-2x)!}-^ 

with the sum being over x an integer between 1 and t/2 inclusive. The MVUE 
of e is E(Xi|T) = T-2k{T). 

In neither case does the variance of the estimator based on T achieve the Cramer- 
Rao lower bound, since we are not estimating the mean-value parameter. □ 



6.1 Completeness 

The following result follows easily from the definition of completeness (Definition |3] 
of Chapter |n]) and the Rao-Blackwell theorem. 

Theorem 7 (Lehmann-Scheffe) Suppose T is a complete sufficient statistic. 
If an unbiased estimator of exists, there is a unique such estimator based on 
T, and this is the MVUE. 

Example 7 Exponential model 

We showed in Example 0] that T := 'Yl^=i^i is complete sufficient. We also 
saw in Example Q that (for n > 1) A* = (n — 1)/T is an unbiased estimator of 
A — which we now see depends only on T. By completeness. A* is the unique 
unbiased estimator of A based on T. If now A° is any unbiased estimator of A, 
then we must have E(A° | T) = A*. Then, by Rao-Blackwell, var(A°) > var(A*), 
and so A* is the MVUE of A. □ 

Note that for n = 1 there is no unbiased estimator, and hence no MVUE, of A. 

Example 8 Let X ~ 'P(A), and consider the parameter B = e~^^. It is easily 
checked that the statistic T = (—1)"^ is an unbiased estimatoJl of 6. Moreover 

^Note that the marginal distribution of T is not Poisson. 

^Strictly speaking, T is not a proper estimator, since it takes values outside the parameter- 
space. If we exclude it from consideration, then there is no unbiased estimator of Q. 
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X itself is complete, so T is the UMVUE of 0. As an estimator T thus has all 
the "desirable" frequentist properties we have so far introduced. Nevertheless we 
would probably never want to use the estimate it supplies, for any data x. □ 



Chapter IV 
Asymptotics 



Often we can not analyse exactly the behaviour of some statistical procedure, 
but we can say something about its asymptotic behaviour as the sample size 
n —7- oo — with luck, this will be a useful guide to what happens for finite n. In 
order for this to make sense, we must conceive of our procedure as being defined 
for arbitrary sample size — so we really have a sequence of experiments, {Sn)i and 
a sequence of procedures, one for each experiment. Thus we might e.g. consider 
the behaviour of an estimator sequence (9„) as n — )■ oo, where 6„ : X" — >■ 6. 
Here we restrict our attention to such estimation problems. 

There are in fact two possible set-ups for such asymptotics: 



(i). We consider a sequence of experiments, governed by the same pa- 

rameter 0, £n having sample space X". There is no required relationship 
between values generated in different experiments. 



(ii). We conceive of an infinite sequence of variables, (Xi,X2, . . .), with a joint 
distribution governed by 6 (in our applications, the (Xj) are independent 
and identically distributed). Experiment £„ consists in the observation of 
the initial subsequence (Xi, . . . In this case the actual value of (say) 

Xi would necessarily be the same in all the experiments. 



For simplicity, unless otherwise stated we restrict attention to the case of a real 
parameter: 6 C R. 
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1 Forms of asymptotics 

Since we are dealing with random variables, not numbers, we have to take care 
over what we mean by asymptotic behaviour. In particular, we distinguish the 
following ways in which a sequence (Z„) of rea0 random variables can "converge" 
to another real random variable Z: 



Convergence in distribution 

We say Z„ tends to Z in distribution, and write Z„ -4 Z, if, for any bounded 
continuous function /i : [R — > R, E{/i(Z„)} — )■ E{h{Z)}. 

A useful equivalent condition relates to the distribution functions Fn{z) := 
Prob(Z„ < z), F{z) := Prob(Z < z): we require Fn{z) — )■ F{z) whenever 
Prob(Z = z) = 0. 

Note that for this definition to be meaningful, we only need to specify the 
marginal distributions of the Z„ and Z — they do not even need to be defined 
on the same probability space. In particular, there need be no convergence 
between the values of Z„ and Z. 

Weak convergence We say Zn tends to Z in probability , or weakly, and write 
Zn Z , if Zn — Z (where denotes the "random" variable which is 
always equal to 0). 

An equivalent condition is: for each e > 0, Prob(|Z„ — Z| > e) — )■ 0. 

For this definition to be meaningful, Zn and Z must be defined on the same 
probability space. 

Strong convergence We say Z„ tends to Z almost surely ("a.s."), or strongly, 
and write Zn — )■ Z, if Prob(Z„ Z) = 1. This again requires all the 
variables to be defined on the same probability space. 

It can be shown that strong convergence =^ weak convergence =^ convergence in 
distribution. 

Lemma 1 (Slutsky) Suppose Yn Y and ZnAceR. Let g : Rx R ^ R be 

continuous. Then g{Yn, Zn) -4 g{Y, c). 



Extensions to variables taking values in a vector space are straightforward 



2. CONSISTENCY 
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2 Consistency 

Definition 1 An estimator sequence (T„) is (asymptotically) consistent for a 
parameter-function $ = 0(0) if, for eacli ^ G 6, T„ — j- (f){6) under Pg. We speak 
of weak [resp., strong] consistency if this convergence is weak [resp., strong] (the 
latter only making sense under the asymptotic set-up |(ii)|) □ 



If Tn is consistent for $, then, for a (bounded continuous) / on f(Tn) is 
consistent for /($). Sinc^ it is possible to find a consistent estimator sequence 
for the full parameter 6 — and hence for any parameter-function $ — it would be 
unwise to use an inconsistent estimator sequence, since for large sample size this 
would not have the achievable property of being close to the estimand 



Example 1 Let $ = E{X \ 6) (assumed to exist), T„ = X„ := Y17=i -^i- "^^^ 
weak law of large numbers shows that, under Pg, Tn — )■ E(X | 9) in probability. 
Thus (T„) is weakly consistent for <I>. 

In the asymptotic set-up |(ii)[ we can similarly call on the strong law of large 
numbers to show that (T„) is strongly consistent for $. 

Note that these results would apply equally if, in £„, we used (say) T„ = Xn/2, 
so ignoring half the data. □ 



Example 2 Let $ be the median of the distribution of X, so that Prob(X < 
0) = |ll We suppose that the distribution function F of X is continuous and 
strictly increasing, but is otherwise arbitrary. For simplicity suppose n odd, 
and let T„ = Xn := -^((n+i)/2) be the sample median. Now let Yi := F{Xi). 
Then the (Yj) are independent and identically distributed ?7[0, 1] variables, and 
so Yn ~ f3{{n + l)/2, {n + l)/2} (see Example Sheet 2, Question 9): in particular, 
Yn 4 |. But Xn = F-\Yn), and so T„ 4 F"H|) = ^-e- T^ is weakly 
consistent for $. □ 



Example 3 Let Xi ~ ^^(A), and consider T„ := nmin{Xj : i = 1, . . . ,n}. Then 
Tn ~ i^(A) for all n. In particular T„ is unbiased for M = 1/A for each n, but the 
sequence (T„) is not consistent for M. □ 



^At any rate (for the independent and identically distributed case) whenever Q is identified, 
in the sense that different values for always give rise to different distributions for X. 
■^For simplicity we omit explicit mention of the underlying parameter-value 0. 
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2.1 Consistency of the MLE 



Under very broad conditions (essentially, in the independent and identically dis- 
tributed case, that the parameter be identified), the maximum likelihood esti- 
mator is strongly consistent. The key property to note is: 

Lemma 2 For all 0' ^ 6 e 6, 

Ee{\ogp{X \e')} < Ee{\ogp{X \e)}. 



Proof. Jensen's inequality states that, for any convex function g, and any 
random variable Y such that the expectations exist, 

E{g{Y)} > g{E{Y)}; (1) 

For by convexity there exist a, b such that g{y) > a + by, all y, with equality at 
y = E(y). Now take expectations of both sides. If, moreover, g is strictly convex, 
we will have equality in if and only if Y is almost surely constant. 

Apply with g = —log, and Y = p{X \6')/p{X \6) if the denominator is 
positive, else 0. Take expectations under Pg. Since we assume the parameter B 
is identifiable, the functions p{- \ 9) and p{- \ 9') are not identically equal, so Y is 
non-constant, so the inequality is strict. Thus 

E,{logp(X I 9') - \ogp{X \9)} = E41og(F)} < log Ee{Y). 

But, with := {x : p{x \ 9) > 0}, 

Eg(Y) = j {^1^} ^^(^1^)^^ = j P{x\9')dx< J p{x\9')dx = l. 

□ 



Corollary 3 For any 9' ^ 9, with probability 1 under Pg there exists N such 
that, for all n> N, ln{9') < ln{9). 

Proof. Since, for any 6*', ln{9') = Y17=i^'^SP{^i\(^')j the summands are 
independent and identically distributed, the strong law of large numbers ensures 
that, with Pg-probability 1, 

n-'{ln{9') - ln{9)} -> E41ogp(X I 9') - \ogp{X \ 9)} < 0. 

□ 



3. ASYMPTOTIC NORMALITY 
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In particular, for any 5 > 0, with probability 1 under Pg there exists N such that, 
for all n > N, both /„(6' — 6) < ln{0) and ln{0 + 5) < ln{6)] and when this occurs 
there must be a local maximum of the function /,„(■) in the interval [9 — 5,9 + 5). 

Corollary 4 For any 9 G <Q, there exists a local maximiser 6„ of the likelihood 
function such that 

Qn ^' 9 under Pe. (2) 

For cases (for example, in an EF) when we know that ln{9) has just one maxi- 
mum 9„, Corollary m ensures that the sequence (9„) will be strongly consistent. 
Otherwise, this property will hold so long as we make a suitable choice of local 
maximumo In the sequel 6„ will refer to such a choice. 



3 Asymptotic normality 

We showed in ^ l3.1l of Chapter lllll that. when = 6', as n — )■ oo, the distribution 
of the standardised score variable 

Sr.{9) =: {Inm~'^Un{9) (3) 

(which, we recall, does not depend on how the data or the parameter are ex- 
pressed) is asymptotically A/'(0, 1). The score test of Hq : Q = 9q is executed 
by calculating the observed value of S'„(6'o)^ = {Ini^o)}'^ {Un{9o)}'^ and referring 
this value to tables of the Xi distribution. 

From this, other asymptotically equivalent results can be derived from the prop- 
erty that, for large n, the log- likelihood function ln{9) is approximately quadratic 
in a region of size 0{n~2) around the MLE 9n, and that values outside such a 
region are essentially ignorable. To start, we work at a non-rigorous level. 

Let 

Ln(6n) = snp Ln{9) 
6»ee 

log Ln. 

(In particular, sup0L„(6') = 1, supg I„(6') = 0.) 

* — though it may not be obvious which choice is "suitable", nor does Corollary 0] guarantee 
that the same choice will work for all 6. We shall ignore these difhculties! 
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Lemma 5 Suppose y = — |A;x^. Then dy/dx = —kx, y = —^k~^{dy /dxY . 
Corollary 6 k~^{dy /dx) = —k^x = {—2y)^ x sgn(— x). 

Now take x = 6 — On, y = ln{6) (so dy/dx = Un{0)), and approximate ln{0) by 
a quadratic y = —^knx"^, with kn ~ some 9 in the region where this 

approximation is adequate. In particular we can take kn to be any of: 

(i) . Jn := -m 

(ii) . j„ := -C(e„) 

(hi). In{e) 

(iv). i„ :=/„(§„) 

where [(iiijj and [(iv)] are motivated by —l'n{0)/ln{0) ~ 1 for large n. Note that /„ 
depends only on the parameter, i„ and ]n only on the data, while J„ depends on 
both. 

We deduce the approximate equivalence of: 

(a) . knhn{e) 

(b) . ki{Qn-e) 

(c) . {-2l„(0)}^ xsgn(e„-^) 

for suitable kn as above, and any of these quantities will be asymptotically 
A/'(0, 1). Asymptotically, inference about G can be based on any of these proper- 
ties. A test oi : Q = 9q based on the approximate standard normal distribution 

of kniQn — ^o) (or the approximate distribution of /Cn(G„ — ^o)^) under Hq is 
called a Wald test: typically, though not invariably, this uses kn = In{do)- The 
likelihood ratio, or Wilks, test of Hq refers — 2/„(^o) to the Xi distribution. 

Example 4 Let X ~ B{n; 6). Then §„ = X/n, Un{9) = {X - ne)/e{l - 9), 
Jn = X/e^ + {n-X)/{l-e)\ In{e) = n/e{l~e)Jn = in = n/e„(l-e„). Under 
Pg, we have an asymptotic standard normal distribution for In{0)~'^Un{0) = 

in{o)Hen - 0) = {X -n e)i^ne{i-e), and for j|(e„ -6) = ii{Qn -e) = 

{X — n9) / y nQn{^ — ©n)- With data X = x, an approximate 95% confidence 
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interval could taken as as 6'„ ± 1.96y^„(1 — 6n)/n. A test of 6 = could 

be based on referring any of (x — uOqY /n6Q(l — Oq), {x — uOqY /n6n{l — On), or 
— 2l„(6'o) = 2[x log(x/n6'o) + {n — x) log{(?i — x)/n{l — Oq)}] to tables of xl- 

With o (observed value) being, in turn, x and n—x (for each of the two categories), 
and e (expected value) correspondingly uOq and n(l — ^o), we can write these test 
statistics as ^(o — e)^/e, X](o — e)^/o, and 2^olog(o/e). The first of these is 
"Pearson's chi-squared statistic", sometimes denoted by X^; the last is sometimes 
denoted by F^. □ 



3.1 Distribution of MLE 

Here we consider in more detail the asymptotic property |(b)| kn (©« — 0) ^ 
J\f{0, 1), for the various choices for kn- We start with |(iii)] 

Theorem 7 Suppose l{x,(f)) is three times differentiahle in (p E &. Fix 9 E &, 
and suppose there exists a random variable Z on X, with finite expectation under 
Pg, such that we have the locally uniform hound: |/'"(X, 0)| < Z for all cf) in 
some neighbourhood N of 9. Then the distribution of In{9)^ (0„ — 9) under Pg 
is asymptotically JV{0, 1). 

Proof. In the following all probability calculations arc under Pq. 
By the mean-value theorem, 

o = c/„{e„) = c/„{0) + {e„-0)c/;{e:) (4) 

for some ©* between 8 and @„. Thus 

/n(9)i(e„ -e) = Sn{9) {-in{9)-^ uU&*n)}-'- (5) 

Now U!^{6) = ^i(^) the sum of n independent and identically distributed compo- 

nents, with Ee{— u^(S)} = i{9). Also, In{S) = ni{9). So, by the weak law of large numbers, 

-7„(0)-ic/;{e)4i. 

Furthermore, with probability 1 ©„ — > 6, so that eventually ©* 6 A^, and then 



\in{er'{K{ei)-K{e)}\ = m-' n 



J2{^"ix^,e*J-l"{x,,e)} 



i = l 

< i{e)-i|e;-e| z„, 

where Z„ := I t>y the weak law of large numbers, while |©* — 9| by 

strong, and hence weak, consistency of @„. We deduce 

-/„{0)-ic/;{©:)4i. (6) 

The result now follows on applying Slutsky's lemma to jSJ, 

using Sn {6») A Ar{0,l) and inj. □ 
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The above result is often expressed as: 6„ -4 A/'{6', ^} (although the right- 

hand side itself varies with n). This can be read as: "0„ is asymptotically 
normal and asymptotically unbiased, with asymptotic variance ln{0)~^" . Note 
that this asymptotic variance is the Cramer-Rao lower bound on the variance of 
any unbiased estimator. Thus 0„ is "asymptotically efficient" . 

For ^)\ use 0„ — 7- 6* to see that (assuming the function i{6) is continuous) 
I{Qn)/I{0) = i{Qn)/i{0) A 1. The result now follows by Slutsky's lemma. 
Similarly the result for [(i)] follows on using the weak law of large numbers to 
show —l'^{9)/I{9) —7- 1. The argument for |(ii)| is similar. 

There is both theoretical and experimental evidence that the data-dependent 

scalings (especially |(ii)| ) yield closer approximations to the normal distribution 

than use of |(iii)| Moreover, as in Example |31 using the fully data-based form 

|(ii)[ we can readily form an approximate 95% (say) confidence interval for 6 

— _i , ^ _i 

as dn ± 1.96jn^ (alternatively, using |(iv)[ we obtain On ± 1.96 in ^ — this form is 

more commonly seen, but may be less accurate). In contrast, direct use of |(iii)| 

or [(i)] would involve solving a typically non-linear equation for 9 — and would 

generally give a poorer approximation to boot. (Note however that none of 

these approximate intervals transforms properly under a non-linear change of 

parameter) . 



4 Hypothesis tests 

Suppose 6 is an interval of the real line, and we wish to test the null hypothesis 

i^o : e = ^0 

for some specified Oq G 6, against the alternative Hi that leaves totally un- 
specified. 



Definition 2 The (maximum) likelihood ratio (MLR) statistic for testing Hq 
against Hi i^ 

A = A(0o) ■.= LM. (7) 

The Wilks statistic isW = W^Oq) := -21ogA(^o) = -2l{9o). □ 



The following important result now follows from the asymptotic equivalence, 
demonstrated informally in §13 of |(c)| and Sn{6)', a rigorous argument can be 
developed along the same lines as Theorem [3 



^You may sometimes see the definition given with numerator and denominator interchanged. 
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Theorem 8 (Wilks) Under Pe^^, the asymptotic distribution ofW{9o) is Xi- 

Notes: Suppose independent and identically distributed sampling from a continuous model. 

(i) . Use of the approximate distribution function for W is typically accurate with a relative 

error of order 0{n~^). 

(ii) . The signed root likelihood ratio statistic \{c)\ of S isl is r{9o) := sgn{9 — 9q) \/W{9(j); 

this will be asymptotically standard normal under Pg^ . However this approximation is 

typically accurate only to order 0(n~2). 

(iii) . In fact ^gQ{W (Oq)} has the form l+b(9Q) /n+0{n^'^) , where we can in principle calculate 

the function b{9). If we replace W by W' := W/{l + b{9)/n\ — a move known as Bartlett 
adjustment — we find that not only the mean but the whole distribution is now with a 
relative error of only 0{n~^). For small samples this can make a dramatic improvement. 

It follows from Theorem |H1 that we can calculate an approximate significance 
level, for testing Hq : Q = 9q against Hi, as 

5L(^o):=Prob(x?>-21ogA), (8) 

where A is the value of A for the observed data. One would reject Hq at level 
5% (say) if SL{9o) < 0.05, or equivalent^ -logA > 1.92, where 1.92 = i x 3.84, 
3.84 = (1.96)^ being the upper 5% point of the xl distribution. This will be 
asymptotically equivalent to the score test (with which it shares the property of 
being invariant under invertible recodings of the observable X 1^1 or the parameter 
0), and to Wald tests based on kn{Q — doY (which are however not invariant). 
(Note that the score test is generally simpler to implement, since it only involves 
quantities defined locally to the null value Oq: in particular, it does not require 
calculation of the MLE 6). 

Similarly we can construct a 95% (say) confidence interval for O as 

/ := {Bee-. SL{e) > .05} 

= {eeB: ln{e) > -1.92} (9) 

(Note that this inference method respects the Likelihood Principle.) It follows 
that, under any Pg G V, in large samples there will be an approximately 95% 
chanc^ that this construction will produce an interval that contains 6. 

4.1 Multiparameter extensions 

The above results (and their proof techniques) extend in a pretty straightforward 
manner to a p-dimensional parameter space. We then have asymptotic equiv- 
alence of the random vectors A~^Un{0) and y4^(9„ — 9), where An is a matrix 

^or, indeed, by replacement of X by a sufficient reduction. 

^Ideally one would like this to hold, simultaneously, for all 6 E B. This would require 
conditions and arguments allowing the extension of the relevant asymptotic sampling results 
(such as Theorem EI) to hold uniformly over (B. 
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square roolHof Kn, i.e. AnA]^ = Kn, the matrix Kn being any of In{d), in, Jn or 
]n (with their obvious multiparameter definitions). Any of these quantities will 
have an asymptotic A/'(0, Ip) distribution under Pq; in particular, asymptotically 
Un{9) ~ 7V(0, In{9)) and (§„ - 9) ^ 7V(0, In{9y^). The asymptotic distribution 
of U{9)^K-^U{9) and of (§„ - 9y^Kn{Qn - 9) will be Xp, as will be that of the 
assymptotically equivalent Wilks statistic W{9) := — 2/„(6'). 

4.2 Composite null hypothesis 

Suppose we wish to test a composite null hypothesis: 

where 111 is a smooth submanifold of the parameter space 6. One approach 
involves generalising Definition |2] as follows: 

Definition 3 The (maximum) likelihood ratio (MLR) statistic for testing Hq : 
6 e M against i^i : 6 G 6 \ IM is 

A:=supL„(^) (10) 

6»6M 

The Wilks statistic is again W := —2 log A. □ 

Theorem 9 (Wilks, general version) Lei dim(6) = p, dim([11) = q, r = p~q. 
Then the asymptotic distribution ofW, given 6 = 9o, is xl for any 6'o G M°. 

Proof. (This proof emphasises algebraic and probabihstic aspects. A fully rigorous account 
would require additional technical conditions and analysis, along the lines of Theorem |7| and its 
proof.) 

In a neighbourhood of Bq, we can regard G as an open subset of Kp, and parametrise M as 
: (3 e B}, where IB is an open subset of R'' and /i : IB ^ G is (1, 1) and diffcrcntiable. Thus 
under M, can take as our parameter B S IB, and then G = h(B). Let B = /3o £ IB correspond 
to the specific null value © = So G i.e. do = h{[5o)- Let U,X be the p X 1 score vector and 
p X p information matrix for 0, evaluated at Bq; and let be the g X 1 score vector and 

qx q information matrix for B, evaluated at /3o. Also let H be the px q matrix with entry 
hij = dhi{f}) / d f} j , again evaluated at Pq. Then 

V = H^U 
J = H^XH. 

Let l'^ denote the log-likclihood of the data under the null value © = Sq, or equivalently B = 
/3o- Under the null we have asymptotic equivalence of —2(^5^ "^upggg l„ [d)} and u'^X~^U; and. 



^We could use e.g. the symmetric or lower triangular square root; but should not let the 
way we form this square root depend on the data. 
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similarly, of —2{l'^ — supggi^ ln(d)} and V J ^V. Hence W, their difference, is asymptotically 
equivalent to 

W* := U'^ {X'^ - H{H^XH)-^H^}U 

= Z^{I - k{k'^k)-^k^}z 
= z^uz 

say, where Z := A~^U, K := H with A a matrix square root of X. We note that If is 
an orthogonal projection matrix (If = El'^ = 11^) of rank r. So we can write If = QQ^ for 
a p X r matrix Q with Q = Ir (take the columns of Q to be an orthonormal basis of the 
range of H). Then W* = y'^Y with Y := Q^Z. But asymptotically Z ~ Af{0,lp), whence 
Y ~ A^(0, Q^IpQ) = JV{0, Ir). The resuh follows. □ 



Note particularly that the asymptotic distribution of W is the same, for any 
value of G satisfying the null model. We can test Hq against Hi by referring the 
observed value w of IV to tables of the Xr distribution. 

Although we have proved Wilks's theorem only for the independent and identi- 
cally distributed case, it holds much more generally. 

Example 5 Generalised linear model Let V = {P^ : G #} be an exponen- 
tial family with its natural parameter. The "saturated" model has observable 
X = (Xi, . . . , Xn) and parameter $ = . . . , $„) G with Xi ~ P$^, inde- 
pendently {i = 1, . . . We consider a GLM M of the form M = 
(0 G 6 C [RP), for specified nxp design matrix An] and a hypothesis (sub-model 
of M), : C = (C being r x p). 

Standard statistical software will output, for any data x and any GLM M, the 
deviance D{M) := 2{sup0g^„ — sup^^j,^ K'P)}- Then the Wilks statistic for 
testing H within M has value w = D{H) — D{M). Assuming An and C are of 
full rank (and An is "well-behaved" as n — ?■ oo), an approximate test of H within 
M is conducted by referring w to tables of Xr- ^ 

4.3 Profile likelihood 

Let : G ^ #, and $ := 0(6). 

Definition 4 The profile likelihood function L$ : # — t- 5? is given by 

L$(0) = sup{L(e) : ^ G e, 0(e) = 0}. (11) 

□ 
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Clearly the MLE := (j){e) of $ is the maxiiniser of with. L^[(f)) — s\rp^^^ L^[(f)) — 
L, and the scaled profile likelihood L$ is obtained by applying flllll directly to L. 

Note that — 2l$(0) is the Wilks statistic for testing $ = 0. In particular, when 
dim# = 1, we typically have, for any G dim{^ : = (p} = p — 1. Then, 
by Theorem ini l<i>(0) > —1.92 with asymptotic probability 95%, under Pq, for 
any 6 with (f){6) = 0; and so the likelihood-based interval {0 : l<i>(0) > —1.92} is 
asymptotically a 95% confidence interval for $. 



Chapter V 



Nonparametric methods 



1 Nonparametric estimation of a distribution 

on R 

In nonparametric analysis the family V of possible distributions for the observable 
X is no longer describable by a discrete or finite-dimensional parameter. Here 
we will restrict ourselves to the case X = X" := {Xi, X2, . . . , Xn), where each Xj 
is real-valued, and V is the family of all distributions under which the (Xj) are 
independent and identically distributed, from some arbitrary distribution P on 
R. We observe X" = := {xi, . . . , and wish to make inferences about P. 

We can describe P in various ways, especially by means of its ( cumulative ) dis- 
tribution function (cdf) F : [R — )• [0, 1], where: 

F{x) = P{-oo,x] 

= Prob(X < x) where X ~ P. 

A cdf F has the following properties: 

(i) . F is non- decreasing 

(ii) . F(-cx)) = 0,F(+oo) = 1 

(iii) . F is right-continuous: F{x) = F{x+) := limy^x F{y). 

Conversely, any function with these properties can serve as a cdf. We then have 
F{x-) := liiRy^x F{y) = P{-oo,x) = Prob(X < x). 
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Definition 1 The empirical distribution based on data is the discrete distri- 
bution 



^ ^ 1 

P„ = P„(x"):=-5^5., (1) 

i=l 

where 6^ attaches probability 1 to the point x. □ 



Thus P"(x") attaches probability mass 1/n to each observed data-point (with 
suitable accounting for any repetitions). Before we have conducted the experi- 
ment, we have random X"-, and correspondingly random Pn = ^ X]r=i ^^i- 

The cdf of P„ is the (random) empirical distribution function Fn{-), given by 

Fn{x) = <n:X,<x}. (2) 

n 

Intuitively, for large n P" (resp., P„) should approximate P (resp., F). Here we 
shall try and refine this intuition, and assess the uncertainty involved. 

The following consistency result is an immediate consequence of the strong law 
of large numbers: 

Lemma 1 For any measurable set A C with probability 1, 

Pn{A) P{A) asn^oo. (3) 

Corollary 2 For any x G IR, with probability 1 

Fn{x) ^ F{x) (4) 
Fnix-) ^ F{x-). (5) 



2 Uniformly consistent estimation of the distri- 
bution function 

The limiting behaviour in does not hold uniformly across all measurable A. 
Indeed we typically have, with probability 1, 

lim inf sup |P„(A) - P{A)\ > 0. 

Thus suppose that the discrete part (if any) of P has total mass d < 1 . Taking An 
to be the (discrete) support of P„, for all n we will have P„(A„) = 1, P{A) < d, 
so that sup^ \Pn{A) - P{A)\ >l-d>0. 
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If however we restrict A to some subcoUection A of sets, we might be able to 
obtain uniformity. The Glivenko-Cantelli theorem shows this for the case that A 
consists of all semi-infinite intervals. 



Theorems (Glivenko-Cantelli) Let Xi,X2,... be independent and identi- 
cally distributed real random variables, with distribution function F. Let F„ be 
the empirical distribution function of {Xi, . . . Then with probability 1, 

sup|F„(t) -F(t)| ^0. 
teiR 



Proof. Let e > 0. There exists a finite sequence —oo = t^ < ti < . . . < tk = oo 
such that F{ti-) - F{ti_i) = Prob(ti_i < X < ti) < e for all i. Then for 

^i-l < t < ti, 

Fnit)-F{t) < Fn{ti-) - F{ti^i) 

< Fn{ti~) ~ F{ti-) + e 

< max I Fn{ti-) - F{ti-) } + e. 

Similarly 

F{t) - h{t) < max - - e. 

Hence for alH G IR, 

- F{t)\ < maxmax - F(t,)|, - + e 

(this holding trivially if t is one of the {ti}). But by Corollary for each i, 
Prob(|F„(ti) - F{ti)\ ^ 0) = 1 and Prob(|F„(ti-) - F(t-)| ^ 0) = 1. Since 
the conjunction of a finite number of almost sure events is almost sure, with 
probability 1 

maxmax||F„(t,) -F(t-)|} ^0, 

whence limsup„_^oo sup^gi^ \Fn{t) — F{t)\ < e with probability 1. Since e > was 
arbitrary, the result follows. □ 



Corollary 4 With probability 1, sup^gR \Fn(t-) - F{t-)\ -)■ 0. 
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3 Inference for the distribution function 

Suppose X has continuous real cdf F, and consider the variable U := F{X). 
Lemma 5 f/ ~ U{0,1). 

Proof. Let u G (0,1). Since F is continuous and non-decreasing, A := 
{y : F{y) > -u} is a closed semi-infinite interval [x, oo) with F{x) = 'm, and 
F{y) < u ^ y < X. So Prob(t/ < u) = Prob{F(X) < u} = Prob(X < x) = 
Fix-) = F{x) = u. □ 



3.1 Hypothesis test 

In particular, suppose we wish to test the hypothesis Hq that the (Xj) are in- 
dependent and identically distributed with specified continuous cdf F. This is 
equivalent to the hypothesis Hq that, with Ui := F{Xi), the (Ui) are independent 
and identically distributed U{0,1), i.e. have cdf F*{u) = u {0 < u < 1). This 
reduces the problem of testing a quite general continuous F to the special case 
F = F*. 



4 Kolmogorov-Smirnov test 

Any reasonable test of uniformity of the Ui will be based on their empirical 
distribution function F*, which from Theorem IHl we know will, almost surely, 
converge uniformly to the uniform cdf F* when the (f/j) are indeed uniformly 
distributed. 

Definition 2 The Kolmogorov-Smirnov statistic for testing uniformity is 

Kn:=sup{\dn{u)\:0<u<l}, (6) 
where dn{u) := F*{u) — u. □ 

It is readily seen that (for continuous F) dn{u) = Fn{x) — F{x) with u = F{x), 
and so 

K^ = snp{\F^ix)-F{x)\. (7) 
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This is form of the Kolmogorov-Smirnov statistic for testing the null hypothesis 
Hq that the common distribution function of the [Xi) is F. The distribution of 
Kn when Hq holds will not depend on though it will depend on n. 

Tables of the distribution of Kn are available: see e.g. Birnbaum, Z. W., (1952), 
Journal of the American Statistical Association, Vol. 47, pp. 425-441. For n > 35 
the asymptotic distribution (see ^ 14. II below) can be used. 

It simplifies computation of Kn to note that the supremum in ((7j) must be attained 
at a data-point. To perform a size-a test of Hq, we look up da{n), the upper 1 — a 
quantile of the distribution of Kn, and reject if the observed value of Kn exceeds 
da{n). We can also form a level-7 "confidence band" J^^, consisting of all those 
continuous distribution functions F which would not be rejected by the test of 
size a = 1 — 7. That is, F & J^^ if, for all x, F{x) lies within the constant-width 
band Fn{x) ± di^^{n) about the empirical distribution functionH 

4.1 Asymptotics 

For any fixed v G (0,1), nF*{v) = #{z < n : Ui < v} has the binomial dis- 
tribution B{n;v) with mean nv and variance nv{l — v). By the central limit 

theorem, n^(i„(t>) A/'{0,f(l — v)}. Similarly, for < f < m < 1, the dis- 
tribution of n{F*{v),F*{u) — F*{v), 1 — F*{u)} is trinomial with probabilities 
{v,u — v,l — u), whence we find cost {n^^ dn{v) , n"^ dn{u)} = v{l — u). 

It can be shown that, considered as random functions from (0, 1) to IR, the 
distribution of n^dn{-) converges (in a suitable sense) to that of the 0-mean 
Gaussian process B{-) determined by the above asymptotic covariance structure: 
cov {B (v), B{u)} = v{l —u) {v < u) (this is known as the Brownian bridge); and 
that the limiting distribution of the functional n^Kn = sup{n^\dn{-)\ : < u < 
1} of nidn{-) is given by the distribution of the same functional of the limiting 
process B{-), viz. K := sup{|-B(m)| : < m < 1}. Derivation of this distribution 
(the Kolmogorov distribution) is beyond our scope, so we merely assert: for a; > 0, 

00 — 00 

Prob(i^ < x) = 1 - 2 ^(-l)-ie-2^'^' ^ V27r ^ ^_(2._i)^.V(8x^). (g) 

i=i ^ i=i 

This function is tabulated in G. R. Shorack and J. A. Wellner, Empirical Pro- 
cesses: With Applications to Statistics (Wiley, 1986), p. 143. 

^ Caution: This works only when F is fully specified. It would fail if for example we wanted 
to test that the distribution is normal but with unspecified mean and variance, and used for F 
the normal distribution function with sample-based estimates of its parameters. 

^If di-^in) is small enough this band will be empty! 
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For large n the size-a Kolmogorov-Smirnov test will reject Hq for Kn > n~^Ka, 
where is chosen so that Prob(A' > Ka) when K has distribution (jH)). It 
follows from the Glivenko-Cantelli theorem that the Kolmogorov-Smirnov test is 
consistent, i.e. if F is not the true distribution function then the probability of 
rejection will tend to 1 as n — oo. 



4.2 Two-sample test 

Suppose now that (Xi, . . . are independent and identically distributed with continuous 

distribution function Fx, independently of (Yi, . . . , Ym), which are independent and identically 
distributed with continuous distribution function Fy- We wish to test the null hypothesis 
^^0 ■ P'x = Py, the common value, F say, being unspecified. 

Let Fx, Fy be the empirical distribution functionss of the X's and K's, respectively. The 
two-sample Kolmogorov-Smirnov statistic is 

D„,m.= snp\Fx{x)-FY(x)\. (9) 

xeR 

If we replace Xi by F{Xi) and Yj by the problem is transformed to one in which 

(under Hq) F is replaced by the uniform [7(0,1) cdf, while the value of D„^m is unaltered. 
Consequently the null distribution of D„^rn is the same, no matter what may be the common 
distribution function F. 

Take n = Np, m = Nq (p & (0, 1), g = 1 - p) and let TV -> oo. Then Nipi {Fx{-) - 
Fx(-)} Bx{-), and Ni qi {Fx{-) — Fy{-)} By{-), where Bx, By are independent Brow- 
nian bridge processes. So, under Hq, N2{Fx{-) — Fy{-)} P^^ Bx{-) — 9 2_By(.). Con- 
sideration of the covariance function of this Gaussian limiting process show it to be of the 
form {p~^ + q~^)^ B* {■), where B* is a Brownian bridge. We deduce that the asymptotic null 
distribution of D* „j := {nm/{n + r?i)}2Z)n,m is the Kolmogorov distribution jgj. Hence we 
can conduct an asymptotically size-a test of the (composite, nonparametric) null hypothesis 
Ho : Fx = Fy by rejecting Hq when D* „j > Ka. Again, this test is consistent, i.e. rejects 
with probability tending to 1 if Fx ^ Fy . 



3 Note that we can not actually compute these values in absence of knowledge of F. 



Chapter VI 
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1 Pivotal inference 

Definition 1 A pivotal quantity, or pivot, in an experiment £ = (X,X, 6, Q,V) 
is a function of X and B that is distributed independently of 0. □ 

Another common synonym for pivot is root. 

Example 1 Suppose Xi ~ N'{M,v), independent and identically distributed, 
i = 1, . . . ,n, where the variance v is known. Let X be the sample mean. Then 
the quantity K := n^(X — M) is a pivo10 having known distribution N'(0,v), 
given any value for M. □ 

Example 2 Now suppose X^ ~ M(M, V), independent and identically distributed, 
i = 1, . . . ,n, with both parameters unknown. Let S*^ be the usual unbiased esti- 
mator of V. Then a pivot is T := {X — M)/{S/y/n), which has a Student t„_i 
distribution (and is asymptotically standard normal). □ 

Pivots are useful for constructing confidence intervals and hypothesis tests. Thus 
in the known variance case, whatever be the value n of M, the probability is 95% 
that K lies in the interval ±1.96 v^, and consequently this is also the probability 
that the interval X ± 1.96{v/n)^ contains fi. With both parameters unknown, 
the familiar t-test of Hq : M = fio is based on the fact that, if Hq is true, the pivot 

""^The scaling by , while not essential for immediate purposes, yields a distribution that 
is insensitive to sample size. For asymptotic analysis it is helpful to consider pivots with non- 
trivial limiting distributions. 
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T is equal to {X — fio) / [S / ^/n) , which can be calculated from the data; and if 
that value does not appears consistent with the t„_i-distribution the hypothesis 
is put in doubt. Similarly, we can base confidence intervals for M on the pivot T. 



2 The parametric bootstrap 

For the normal case, the actual sampling distributions of the above pivots are 
available from general theory, and readily manipulated. Let us suppose for a 
moment that one or other of these conditions were false. What could we do? For 
simplicity we just consider the normal location model of Example ^ 

Suppose we have data (xi, . . . , a:„). We start by estimating the underlying data 
distribution: an obvious estimate is ^/{x, v). Although this of course differs from 
the true distribution Af[M,v), it belongs to the same normal location family. 
Consequently, both these distributions will have the identical distribution, D 
say, for the pivot K. So consider a new sample, {X^, . . . , X*), the same size n as 
the original data-set, but now generated from this estimated distribution. Then 
K* := n^{X — x) has the same distribution as K = n^{X — M). Even if we 
couldn't actually compute this distribution, we can readily approximate it by 
simulation. From the estimated distribution A/'(x, v) we generate many random 
samples ( "resamples" ) of size n, and for each compute the associated value k* of 
the pivot K*: thus for a specific resample {x^, . . . , ), we obtain k* = [x* — x). 
If we do this for M independent resamples, we will obtain a random sample of size 
M from the distribution D of K* — which is the same as that of K. Let D* denote 
the empirical distribution of these values, i.e. the discrete distribution putting 
probability 1/n on each value appearing (with obvious adjustment in the case of 
repeated values). Then D* is an approximation to the desired pivotal distribution 
D. Using D*, we can now conduct approximate pivotal inference for M, based 
on the original sample mean x. If M is sufficiently large the approximation will 
be good, with high probability. 

2.1 Other parameters 

Staying with the normal location model, suppose (purely for illustration) we are 
interested in the upper quartile Qu of the distribution. We could just use the 
known normal-theory relationship Qu = M -|- 0.6745 ?;2 to make an appropriate 
adjustment to our inference for M. Alternatively (if less efficiently) we might 
base inference directly on the pivot Ku ■= n2(^Xu — Qu), where X„ is the upper 
quartile of the sample (Xi, . . . , Xn). Again, for a given dataset (xi, . . . , Xn) we 
can use resampling from the estimated distribution A/'(x, v) to approximate the 
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distribution of Ku- In particular, we can estimate b := E{Ku), and so obtain the 
approximately unbiased estimate — n~^b of Qu- 

3 Nonparametric bootstrap 

A still further level of approximation ignores the fact that the underlying distri- 
bution is known to be normal. For our estimated distribution we now use, not 
M(x, v), but a non-parametric estimate based on the original sample (xi, . . . , Xn)- 
Various forms are possible, but the most usual is also the simplest: just the em- 
pirical distribution of the data. The corresponding estimate of Qu is now the 
sample upper quartile, Xu- 

A resample is now simply a random sample with replacement from the values in 
the original data-set (in particular, this will usually contain some repeats). Let 
{xl, . . . ,x^) be a generic resample. Then the associated estimate of the pivot is 
fc* := — Xu)- If n is large then the empirical distribution of the data will 

be close to the parametric estimate A^(x, w), and then an empirical distribution 
for ku, computed from a large number M of resamples, will approximate the true 
distribution of Ku = n'i^Xu — Qu), and can be used, together with the estimate 
Xu from the original data, as a basis for approximate pivotal inference about Qu- 

3.1 Unknown distribution 

The nonparametric bootstrap made no use of the assumption that the underlying 
distribution was normal with variance v. Consequently it could be applied, un- 
modified, to any location family, to make inference about (say) its upper quartile 
Qu- So long as n is large enough, the shape of the estimated distribution, used 
for resampling, should be close to that generating the original data — i.e. they 
should, to a good approximation, differ only in their location. Then (so long 
as M is large enough) the bootstrap distribution of the pivot should be close to 
its true distribution, and hence usable to make inference about Qu- And this 
holds for any underlying distribution — which can thus be completely unspecified. 
In particular, even the assumption of an underlying location model is no longer 
needed. 

3.2 Interval estimation 

There are various ways of using the bootstrap to form an approximate 95% (say) 
confidence interval for a parameter such as (say) Qu- These include: 
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Crude bootstrap interval Approximate tlie distribution of Ku by a normal 
distribution with mean b and and variance o"^ estimated from the bootstrap 
resamples, and find lower and upper confidence limits by equating n^{xu — 
Qu) to (/+,/_) := 6± 1.96a. 

Basic bootstrap interval As above, but take /_ [resp. /+] to be the lower [resp., 
upper] 2.5%-point of the nonparametric bootstrap estimate of the distribu- 
tion of Ku- 

Percentile interval Take the lower [resp., upper] confidence limit for Qu to 
be the lower [resp., upper] 2.5%-point of the bootstrap distribution of X*. 
(This works if the there is a monotonic function g such that the pivot 
g{Xu) — g{Qu) has a symmetric distribution about 0, with a shape well 
approximated by that of its bootstrap distribution.) 

3.3 Other pivots 

The same essential logic can be applied using from other pivots. For example, if 
we had started with a location-scale model, we might have used a "Student-f - 
type pivot, such as (^Qu — Xu)/ S, where S is some location-invariant estimator 
of scale, such as the interquartile range of the data {Xi, . . . ,X„). We can again 
approximate its distribution, again by resampling from the empirical distribution 
of the data. And this might be more "robust" , and more quickly convergent, than 
just using n2(Qu — Xu) as our pivot, since all we now require is that the empirical 
distribution of the data and the true data-generating distribution should agree 
(approximately) up to a location-scale transformation [i.e., we need no longer 
require them to have approximately the same scale). Such "studentised pivots" 
are generally recommended over simple unsealed pivots. 

4 Further considerations 

The basic idea of bootstrap inference, and its basic implementation (though 
highly computer-intensive), are extremely simple and appealling — almost too 
good to be true. But it really does work. 

Of course there are many loose ends to tie up. Theoretically, we would want to 
be able to show that, under suitable conditions, the convergence of the sample 
empirical distribution to that of the data-generating distribution is fast enough 
that, for large n, substituting the former for the latter is asymptotically negli- 
gible, when we focus attention on their associated distributions for the chosen 
pivot. We would also like this convergence to hold uniformly, over a suitable 
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class of distributions. Pragmatically, we would like guidance on how big should 
n and M be for bootstrap inference to be reliable. There is also the question 
of what is the best pivot to use, among many variant possibilities, both from 
the point of view of getting good sampling distribution approximations out of 
the bootstrap, and from more principled points of view, such as using the data 
most appropriately and efficiently {e.g. through a sufficient statistic, and/or with 
appropriate conditioning). 

Proper attention to such points can involve very sophisticated analysis, seemingly 
far from the "back-of-the-envelope" nature of the basic idea, and has turned 
bootstrap inference into one of the most active areas of current statistical research. 
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Chapter VII 
Bayesian Inference 



Bayesian inference is an over-arching and self-consistent approach to the enter- 
prise of Statistics. For any problem, we use Probability to model both "physi- 
cal" uncertainty about potential observables, given the relevant parameters, and 
"epistemological" uncertaintjjl] about those unknown parameters. We then have 
a full probabilistic joint distribution for all unknowns, and statistical inference 
reduces to calculating the conditional distribution for unknown quantities of in- 
terest (which might include future observables, as well as unknown parameters), 
given the observed data. Anything else — e.g., summarisation and display of 
this distribution, calculation of its features, or using it to address problems of 
decision-making under uncertainty — is just an optional add-on. So once we have 
passed the (important!) stage of specifying our initial ingredients, Bayesian in- 
ference just reduces to the mathematics of probability theory. That is, Bayesian 
inference replaces the inverse relationship 

STATISTICS = (PROBABILITY)^^ 

by the direct relationship 

STATISTICS = PROBABILITY. 

If only every one agreed that the Bayesian approach was the right one, then 
Statistics would be set on a firm logical and mathematical basis! 

This logical coherence, avoiding the need for a new trick for every new problem, 
means — in principle at least — that Bayesian methods can be applied to an 
enormous range of extremely complicated and realistic models. In practice, how- 
ever, this raises new challenges: both of specifying appropriate prior distributions 

^ There is a large body of theory, based for example on principles of rational economic 
behaviour, justifying the use of Probability Theory for this purpose. 
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over high-dimensional parameter spaces, and of computing, and computing with, 
the resulting posterior distributions. Much of modern Bayesian research is aimed 
at addressing these twin challenges. 

1 Prior and posterior 

A truly fundamentalist!! Bayesian approach does not mess with parametric models 
at all, but directly models joint epistemological uncertainty about a collection 
of observables including both those to be observed and those to be predicted; 
it focuses on directly finding the conditional distribution of the latter, given 
observations on the former. However for current purposes we restrict attention 
to Bayesian analysis of a parametric statistical experiment S = (X,X, 6, Q,V). 

The statistical model V provides the conditional density p{x \ 9) (with respect to 
some fixed dominating measure /i on X) of the observable X, given G. It is up 
to the statistician/decision maker to specify, in addition, a marginal distribution 
n (interpreted as representing epistemological — typically "subjective" or "per- 
sonal" — uncertainty) for the parameter 6, which now has the mathematical 
status of a random variable. We denote the density of 11 (with respect to a suit- 
able measure v on the parameter-space 6) by vr(-). The distribution 11 describes 
perceived uncertainty about 6 before taking any account of the observed value 
of X, and should ideally be specified at this stage. It is thus termed the prior 
distribution of 6. 

There is no necessary connexion whatsoever between the two building blocks: the 
conditional distributions for X given G, as specified by the statistical model, and 
the marginal distribution for 6, as described by the prior. But once they have 
been separately specified, their conjunction fully determines the joint distribution 
for (X, 6). We can manipulate this in any way we please, using the standard 
probability calculus, to address any queries we may have. 

The joint density of (X, 9) (with respect to /i x z/ over X x 6) is 



Although this is a distribution for the observable X, it is not purely "objective", 
but rather an odd amalgam of the "objective" model density p{x \ 9) and the 



p{x,9) =p{x\9)ti{9). 



(1) 



The marginal density for X alone is thus 




(2) 



^See e.g. Bruno dc Finetti's highly original two-volume work Theory of Probability. 



1. PRIOR AND POSTERIOR 



69 



"subjective" prior density 7r{6). It represents your uncertainty about how X will 
turn out, lacking knowledge of the value of 0. It is termed the (prior) predictive 
density of X. 

Using and the conditional density 7r(^ | a;) for O, given X = x, is 

7r(6' I x) = p{x,9)/p{x) 

oc p{x\e)7r{e) (3) 

where the proportionality sign indicates the omission of a factor [viz., l/p{x)) 
that does not depend on the argument 6: after observing data X = x, this 
factor is simply a constant. It is enough to know 7r(6' | x) up to such an unspec- 
ified factor, which can always be recovered from the normalisation condition: 

j^7r{e\x)duie) = i. 

The distribution U^, whose density is 7r(6'|x), represents the relevant episte- 
mological uncertainty about G after having observed X = x. Obtaining and 
describing this posterior distribution is typically the main, and often the sole, 
aim of a Bayesian analysis. We can apply any probabilistic manipulations to this 
end, but normally it will be accomplished by the application of formula dHj). This 
is Bayes's theorem. Since, for the given data X = x, p{x \ 9) is proportional to 
the likelihood function L(6'), we can also write this as 

Tx{e\x) (XTx{e)L{e). (4) 

In words: 

POSTERIOR oc PRIOR x LIKELIHOOD. 

In particular, the only feature of the statistical model needed to perform a prior- 
to-posterior analysis is the likelihood function (up to proportionality) for the 
specific data at hand. Thus so long as the identical prior distribution is used 
for a quantity G in any experiment in which it appears as a parameter (a not 
unreasonable requirement), Bayesian inference will respect LP. 

1.1 Sufficiency 

Suppose that T = t{X) is a sufficient statistic for G, based on data X. Since 
LP =^ SSP ^ WSP, it follows that the posterior of G depends on the data only 
through the value of T, and can be computed starting from the marginal exper- 
iment £ti which observes only T . This property can streamline computations, 
since instead of working with the full data we can use the model for, and observed 
value of, a sufficient statistic. In particular, and most usefully, this holds when 
T is minimal sufficient. 
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Another way of understanding this property is as follows. Sufficieney of T can be written as the 
conditional independence property: 

XALe\T, 

expressing the fact that the conditional distribution of X, given both © and T, can be chosen 
to depend on T alone. This interpretation of conditional independence is meaningful even if © 
is not random. However, as soon we take a Bayesian stance, so that © is just another random 
variable, we can use the symmetry of probabilistic independence to deduce: 

© _U_ X I T. (5) 

And we can now interpret JSJ as asserting that the conditional (= posterior) distribution of ©, 
given X and (redundantly) T, in fact depends on T alone. 

1.2 Comparison of hypotheses 

A special case of prior-to-posterior analysis arises when 6 is a discrete set, X say, 
of hypotheses about the distribution of X. Given : Q = i, X has density 
Pi{x), and n{i) is now the prior probability Prob(ifj) of hypothesis Hi. From (H)), 
the posterior probabihties {Prob(ifi | x)} then satisfy 

Prob(iJi I x) _ Prob(iJi) pi{x) 

PToh{Hj\x) ~ PToh{Hj) ^ pj{x)' ^ ^ 

The term Prob(ifi) /Proh{Hj) is the prior odds on H^ as against Hj, and similarly 
Prob(ifi I x)/PToh{Hj \ x) is the posterior odds. The term Pi{x)/pj{x) = Li/Lj is 
the corresponding likelihood ratio. So 

POSTERIOR ODDS = PRIOR ODDS x LIKELIHOOD RATIO 

Knowing the (prior or posterior) odds between all pairs of hypotheses, we know 
the probabilities up to a scale factor, and hence, by normalisation, we can recover 
them completely. This is particularly useful in the case of just two hypotheses Hq 
and Hi. In this case the odds u in favour of Hi as against Hq is just a recoding 
of the probability vr = Prob(ifi) (and hence of the distribution over {Ho,Hi)): 
uj = IT / {1 — n) , IT = 00/ {I + 00). The likelihood ratio term is pl(a;)/po(a^)• 
l . 3 S ummarisat ion 

Once the posterior distribution has been obtained it can be summarised in a 
variety of ways, according to our further purposes and interests. For example, 
we could calculate the posterior mean, or posterior mode, of 6, which can be 
regarded as point estimates. For any chosen form of summary, the procedure of 
always calculating that summary, for any data, yields an estimator; this can be of 
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interest even to a confirmed frequentist — who would, however, judge it in terms 
of his own frequentist criteria]! which are of no interest to the pure Bayesian. 

We could similarly summarise remaining uncertainty about 0, in the light of the 
data, by means of, say, the variance, or the entropy, of the posterior distribution. 
Another useful summary is a (posterior) credible region: a subset of 6 having 
some given posterior probability {e.g. 50% or 95%). There are of course many 
such regions; all are equally valid, but some, such as 1-sided, or equal-tailed 
central regions, may be more useful. A region of the form {6 & Q : 7r{6 \x) > c}, 
bounded by a contour of the posterior density, is a highest posterior density 
(HPD) region; the constant c being chosen to yield the desired probability. Such 
a region has minimum measure for its chosen posterior probability. Note that a 
HPD region is not invariant under transformation of the parameter. An invariant 
"snug" region can be constructed using, instead, the contours of the likelihood 
function: {6* G 6 : L{9) > c}. Such a region has minimum prior probability for 
the given posterior probability. 

For a real parameter, a Bayesian analogue of a classical 1-sided significance level, 
for testing the hypothesis 6 = 6^0, is the posterior probability Ilx{Q > Oo)- But 
there is not really any useful analogue of a two-sided significance level. 

Bayes factor 

Suppose we seriously entertain the possibility that a hypothesis Hq : Q = 6o 
could be exactly true. Then we should use a prior distribution that gives this 
a positive probability, say Prob(ifo) = t^o', and our inference about whether or 
not Hq holds, on the basis of data X = x, would be captured by the posterior 
probability Prob(ifo I x). 

Under the alternative hypothesis Hi : Q ^ 6q (which has prior probability tti = 
1 — ttq), the value of 6 remains unspecified. Consequently the Bayesian must 
assign a conditional distribution to 0, given Hi. Let this have density 7t{6\ Hi). 

By Bayes's theorem. 



Moreover, p{x\ Hq) 



Prob(iJi I x) 7Ti p{x I Hi) 
Prob(ifo I x) Hq p{x \ Hq) 

p{x I 6^0), while 



(7) 




the predictive density of X, given Hi. 



■^Such Bayesian-derived estimators do typically behave very well in frequentist terms 
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Note how the prior probabihties (ttojTTi) only enter into the first factor in (0), 
while the prior distributions of 6 conditional on the hypotheses only enter into 
the second factor. This second factor is termed the Bayes factor (in favour of Hi 
as against Hq). It is essentially a likelihood ratio (in favour of Hi as against Hq, 
but now the relevant "likelihood" of Hi is not purely objective since it involves 
the conditional prior density 7i{6 \ Hi). 

The posterior probability of Hq and the Bayes factor will depend on the data 
through the value of a sufficient statistic. 



1.4 Prediction 

Bayesians tend to be more interested in predicting as yet unobserved observables, 
rather than in inference about forever unobservable parameters. 

Suppose X represents the data in an experiment, and Y some future observable 
to be predicted. We suppose given a joint statistical model for (X, F), given 9, 
with densities p{x, y\0)] and a prior density 7i{6) for 0. The rules of probability 
give: 

p{y\x) = J p{y\x,9)7r{e\x)du{9), (8) 

expressing the (posterior) predictive density of Y, given the data X — x^ as a 
mixture of the model-based conditional densities, where the mixing distribution 
is the posterior distribution of G. 

Often, X and Y will be independent, given 0. Then p{y \x,9) = p{y \ 9), and we 
get 

p{y\x) = J p{y\9)n{9\x)du{9), (9) 
a posterior mixture over the unconditional statistical model for Y. 



2 Asymptotic posterior distribution 



We deal with this very informally. 

For large n, in the vanishingly small interval around the MLE 6'.„ where the log- 
likelihood ln{9) is significant, it will be approximately quadratic, with slope and 
curvature j„ at 6'„. Hence 

L„(^) ^exp-^j„(^-^„)'- (10) 
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So long as the prior density n^O) is continuous and positive, it will be approx- 
imately constant in this region, so that the posterior density will be essentially 
proportional to the likelihood fllO|) in the region where it is significant, and neg- 
ligible elsewhere. We deduce that asymptotically 

e^mOnX')- (11) 

Importantly, this applies for any (continuous and positive) prior density: given 
enough data, a priori divergent opinions will be brought into posterior agreement. 

1 ^ 

We note the similarity of the conclusion fjll|) . expressed as jn (ft„-e) ~Ar(o,i),to 

1 ^ 

the asymptotic sampling property jn(0n — 6*) ~ A/'(0, 1). Thus with enough data 
both approaches will give numerically similar inferences (though with different 
interpretations). 

3 Conjugate families 

Although there is no logical relationship between the statistical model and the 
prior distribution, certain forms of prior can be more tractable than others. Even 
though these usually will not be good approximations to real prior uncertainty, 
they can be used as building blocks in more complex and realistic specifications, 
thus simplifying their analysis. Such "artificial" priors are therefore worth atten- 
tion. 

3.1 Closure under sampling 

Definition 1 For a given statistical experiment, a family J-" of distributions over 
6 is closed under sampling if, for any prior distribution 11 G J-" for 0, and any 
data X, the posterior distribution of given X = x is also in J-". □ 

This property is particularly valuable when our statistical model is an exponential 
family: 

p{x I 6) = exp{a(x) + b{e) + u{eft{x)}. (12) 

In this case, consider the family J-" of all prior distributions whose density (with 
respect to a given underlying measure u on 6) is of the form: 

7i{9) oc no{9) exp{no b{9) + u{9fto}, (13) 

a (p-|-l) -parameter exponential family with hyperparameters no G R"*" and to G 
— this is termed a conjugate family for the given EF. If the prior is given by 
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f|T^ . and we observe data from the model (fT^ . the the posterior 

will evidently have the same form, with the hyperparameters Uq and to replaced, 
respectively, by rii = uq + n, and ti = to + t (t = J2i=i'^i^i))- Consequently a 
conjugate family is closed under sampling. Since it is itself a (p + l)-parameter 
family, it will usually be easy to describe and manipulate its generic member, 
thus solving at one fell swoop all possible inference problems that could arise for 
any data. 

Note that, since ttq (and indeed the underlying measure u) is arbitrary, any prior 
is a member of some conjugate family, so it does not really make sense to talk of 
"a conjugate prior". However, such a description is frequently used for the case 
where 9 is the natural parameter $ (so m(0) =0), i' is Lebesgue and ttq = 1. 



Using a mixture of conjugate priors is more flexible than using just one. 

Suppose the prior density can be expressed in the form 7r((9) = WjiTj{6), for some given 
sot of densities {irj ) and probability weights [wj). We can represent this in terms of an auxiliary 
discrete parameter J, with Prob(J = jr) = Wj and 7r(S \ J = j) = 

In the joint posterior for (J, ©) given X = x, we then have n{8 \ J = j,x) = nj (6 \ x). Also 
Wj := Prob(J = j\x) oc WjPj{x), where pj is the prior predictive density based on prior ttj, 
and the constant of proportionality is recoverable from the condition Y2j '^j = 1- Consequently 
tt{9 I x) = w*TTj{6 I x), again in mixture form but with revised weights. 

When each of the ttj is of conjugate form, it is typically easy to compute both the component 

posterior ■7Tj{9 \ x) and the predictive density Pj{x), hence the posterior weights i'l^j)- 



4 Normal models 

In this Section, all densities are with respect to Lebesgue measure. 
4.1 Normal location model 

Consider the normal location model, parametrised by its mean M with the vari- 
ance V known. Let h := v~^, the sampling precision. Thus 



3.2 Mixture priors 



(14) 



The likelihood, for data x = (xi, . . . , x„), is 



LnijJ.) oc exp —-nh{x — /i)^. 



(15) 
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This is the exponential of a quadratic form in /i, so conjugacy suggests using a 
prior of similar functional form: 

7r(/i) ocexp-^/io(/i-mo)^ (16) 

i.e. 

M~Ar(mo,/io')- (17) 

The posterior density 7r(/i | x) oc 7r(/i) Ln{fi) is again the exponential of a quadratic 
form in /i, so that 

7r(/i I x) oc exp -^hn{fi - m„)^, 

i.e. 

M|x~Ar(m„,/i;^). (18) 
Equating linear and quadratic coefficients yields: 

hn = ho + nh (19) 
homo + nhx 

rrin = —7— — 7—. 20 
ho + nh 

In particular, the family of normal distributions for M is closed under sampling 
from Af{M, h-^). 

Note that the same posterior p8p would be obtained if we had only observed the 
sufficient statistic X, having distribution J\f{M, {nh)~^}: the analysis is essen- 
tially identical, after first replacing by 1 and h by nh. 

Equation (|T9|l exhibits the posterior precision /i„ as the sum of the prior precision 
ho and the total sample precision nh; while fl2()|l gives the posterior mean m„ of 
M as the weighted average of the prior mean mo and the sample estimate x, each 
weighted with its associated precision. 



Summaries 



Point estimate We might take the posterior mean m„ as in ()20|) as a "Bayesian 
point estimate" of M, while the remaining uncertainty about M is captured in 
the posterior precision hn as in fjTIUl . 



_ 1 

Interval estimate A 95% credible interval for M would be m„ ± 1.96/?-™ ^ ■ 
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Hypothesis test If there were some value jUq for M of special interest — corre- 
sponding to some "null hypothesis" Hq : M = fiQ oi no effect — we might assess 
the tenability of the hypothesis Hq in the light of the data by assessing how far 
out in a tail of the posterior distribution is the value fiQ-. this could be quantified 
by the posterior probability Prob(6 | x) > fiQ, with values close to either or 1 
making Ho look untenable. Note however that since our prior probability that 
Hq holds exactly was 0, so will be its posterior probability. 

Predictive distribution 

The prior predictive density of a single observation X is given by 

/oo 
P{X \ fl) 7T{fl) dfx (21) 
-oo 

which is readily computed from fll4|) and p6|) . We obtain 

X~Ar(mo,/io^ + /i"^)- (22) 

Alternatively we can argue directly as follows. Given M, the distribution of Z := X — M is 
Af{0, h^^). Since this does not depend on M, Z is independent of M, with this distribution. 
Also, M ~ Af{mQ, hg^). So X = M + Z, where the summands are independent normal. Their 
sum is thus normal with the means and variances summed, yielding I22i . 

A similar argument shows that the prior predictive distribution of the sample 
mean X of n future observations is 

X Ar{mo,h^^ + {nh)-^). (23) 
The posterior predictive density of a new observation Xn+i, given data X = x, is 



p{xn+i I x) = 




/i) 7r(/i I x) d/i 



where (importantly) no further conditioning on x is necessary in p{xn+i \ yu), since 
Xn^i _LL X I M. This is of the same form as fl21|) except that the prior hyperpa- 
rameters are replaced by the posterior hyperparameters. Thus we have 

X„+i|x~A/'(m„,/i-i + /i;i). (24) 
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Bayes factor 

In the set-up of § 11. H| suppose we want to test Hq : M = against Hi : M ^ 0. 
For simplicty we take the conditional prior distribution, given the alternative 
hypothesis, as M | if i ~ A/'(0, /ig Then Z := {nh)2X is sufficient for M, with 
Z \Hq Af{0, 1); while, using fl^ . Z | iii ~ Af{0, hn/ho). So the Bayes factor in 
favour of Hi is 

^10 '- 



p{z\ 


\Hi) 


p{z\ 


\Ho) 



1 



exp^{nh/hrr)z'^. (25) 



Large sample behaviour 

As n — >■ oo, the sample mean X remains bounded almost surely. We can thus 
argue as follows. 

The exact posterior Minin, h~^) depends on the prior hyperparameters (mo, ho), 
but for large n their influence is swamped by the data and from (I19|l and (j20|) we 
have, approximately, 

nh 
i.e. 

M|x~Ar{x, (n/i)-^}. (26) 
This may be compared with the sampling property: 

In either case we have X — M ~ A/'{0, (ri/i)"^}, but for the sampling distribution 
this holds conditionally on the parameter M, whereas for the Bayesian posterior 
it holds conditionally on the data X. 



rrir, 



If wo compare the posterior means m* and m„ based on two different priors, Af{ing, (hg) ^} 

and A^{r?iQ, (h^)^^}, we find they approach each other at rate n~^, as against the slower rate 
_ 1 

n 2 at which each approaches the true value of M. Using a suitable measure of distance between 

_ 1 

distributions, the full posterior distributions approach each other at rate n 2 — a result that 
continues to hold even for non- normal priors (and extends to other models). So the posterior 
distribution M| x ~ J^{x, {nh)~^} is in "asymptotically objective", in that it is insensitive to 
initial opinions about M. 

As for the posterior predictive distribution I24i . given extensive data this will be approxi- 
mately Af(x,h~^); and the resulting distributions for different priors now approach each other 
at rate (a result that again extends to other models). 
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From 1251 . the Bayes factor Bio behaves asymptotically as (ho/h)'^ n~ 2 exp ^nhx^. Note 
that the dependence on the prior precision ho docs not disappear in the limit. For fixed non-zero 
X, Bio — > 00, corresponding to increasingly strong evidence against Hq. 

It is interesting to contrast the Bayes factor approach to testing Ho with the classical 

approach. The latter would reject Ho when \Z\ = (nh) 2 |X| exceeds some critical point k [e.g., 

1.96), irrespective of the value of n. The approximate Bayes facorc at this critical point is 

(ho/h) 2 n~ 2 exp -^k^, which tends to as n — > 00, thus eventually discrediting Hi in favour of 

the null hypothesis Ho- So the same data which would lead to rejection of Ho from a classical 

perspective count strongly in its favour from the Baycsian perspective. 



Improper prior distribution 

With each observation made, the precision parameter hn of M increases. Hence 
we could argue that, to represent "great prior uncertainty", that precision should 
be taken as low as possible. But for a proper prior we require Hq > 0. Formally, we 
consider /iq i (for fixed data, while mo stays fixed, or at least bounded). Then 
hn — )■ nh, m„ — )■ x. Hence even for finite samples we might regard the formal 
posterior fl2fi|l as "objective" , inasmuch as based on minimal prior information. 

We could alternatively try taking the limit as /ig i in the prior. However the 
prior density 

= (^^^ exp -^ho{fi - nioY 

then converges pointwise to 0, which does not define a probability distribution. 
If instead we ignore the normalising constant and work from 

7r(/i) oc exp -^holfi - 

we obtain the (pointwise) "formal limit" 

7r(/i) oc 1. (27) 

Again fl27|l does not describe a genuine probability distribution, since there is no 
way of scaling it to integrate to 1: it is an "improper" prior density. Nevertheless, 
if we insert it into Bayes's theorem, expressed as 7r(yU, | x) oc 7r(/i) I/„(/x) we obtain 
the perfectly well-behaved "objective posterior" (pUj) . 

Fully subjective Bayesians frown upon such "objective" priors and posteriors, 
and indeed the purely formal manipulations involved can not be put on a firm 
logical basis, and can generate paradox and inconsistency. Nevertheless they 
have a strong intuitive appeal. They are particularly prized as ways of gener- 
ating essentially the same result as a frequentist analysis, but with a Bayesian 
interpretation. For example, the standard frequentist 95%-confidence interval 
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x± 1.96{nh)^^ has posterior probability exactly 95% under the formal posterior 
fl26|) . Similarly, the P- value P = Prob[A/'{0, {nh)~^)} > x] for testing the hy- 
pothesis Ho : M = against if i : M > can alternatively be interpreted as the 
posterior probability Prob(M < | x) under pUj) . 

Predictive distribution Combining the formal posterior distribution ()26|) for 
M I X with the sampling distribution A/'(M, h"^) for the posterior predictive 

distribution for Xn+i is | x ~ Af{x,{l + n~^)h~^}. A 95% "prediction 

interval" for is thus a;± 1.96(1 + r2~^)2/;,^2 . This agrees with the frequentist 
prediction interval, constructed using the property that, conditional on any value 
for M, Xn+i - X ~ 7V{0, (1 + n-^)h~^}. 

Bayes factor If (for finite n) we try to express great prior uncertainty by /iq i 0, 
we find that, for any fixed data, the Bayes factor Bio of fEHjl tends to 0. Such 
behaviour is clearly unsatisfactory. There is no non-controversial way of using 
improper priors for the comparison of hypotheses. 

4.2 Normal scale model 

Now suppose the mean is known, M = /i say, but the precision H is unknown. 
The likelihood function for H is 

L„(/i) oc (28) 

with s*^ the observed value of 5*^ := n~^'^^^^{Xi — /i)^, the mean squared 
deviation of the observations from their known population mean /i; this is an 
unbiased estimator of the variance V = H~^. The likelihood form fl28|) shows 
that we have an exponential family; it has natural parameter natural sufficient 
statistic ns*^, and mean-value parameter nV . 

Conjugacy suggests taking a prior for H of similar form to fj28p . This is achieved 
if we take 

~ r Qz/o, ^uori^ (29) 
(or, equivalently, H ~ X^q/^o'^o)' since this has density 

Ti{h) oc /il^«-ie-^^»"o'^ (30) 

(where the proportionality sign hides a purely numerical factor, not depending on 
h). This has expectation I/tq, so that the hyperparameter Tq is a "prior guess" 
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at V = H ^; and coefficient of variation 2z/o 2, so that the "degrees of freedom" 
hyperparameter z/q is a measure of the precision of this prior. 

Forming the posterior density by muhiplying (p?n|) and and simphfying, we 
find 

H\^^T(lur„lunT^] (31) 



with 



2 ' 2 



1^0 + n 



2 ^oTq + ""^ 

= 

In particular, the Gamma family of distributions for H is closed under sampling 
from Af{fi, H~^). We observe that the posterior precision hyperparameter is the 
sum of the prior precision hyperparameter and the sample size, while the posterior 
"guess" at is a weighted average of the prior guess and the data-based estimate, 
each weighted with its degrees of freedom. 



5 Computational methods 

Although the logic of Bayesian analysis is clearcut, as soon as we move away 
from simple conjugate analysis its computational implementation can be far from 
straightforward. We illustrate some useful techniques with a fairly simple exam- 
ple. 



Non-conjugate normal analysis. Consider the two-parameter model Xj ~ 
A/'(M, H), independently {i = 1, . . . ,n). A minimal sufficient statistic is (X, S^), 
where S"^ := J27=ii-^i ~ -^Y/ We assume a joint prior distribution of the 
form: 

M ~ Ar(mo,/io'), 

independently. This family of joint distributions is not conjugate: the joint pos- 
terior is not of the same form. 

Using results for the normal scale and normal location models separately, we see 
that, in the posterior, 

M I X /i ~ J\f ( ^o"^o + 1 \ 1^32^ 

' \ ho + nh ' ho + nh J 

^'""'^ ^ UoT^ + {n-l)s^ + n{x-fiy ^^^^ 
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However the marginal posterior distributions are non-standard. 

We can use (j33|l and (j32|l to compute the (joint) mode of the posterior 

distribution, since each component must be the mode of the conditional density, 
evaluated conditional on the other component. We obtain (for uq + n > 2): 

homo + nh*x 

i^oTq + {n — l)s'^ + n(x — fi*y ^'^^^ 

These combine to produce a cubic equation that can be solved for /i*. 

Alternatively, starting from a trial value we can cycle between fj34|) and fl35|) to 
update each estimate in turn: this is the method of iterative conditional modes. 
It will converge to a local mode of the joint posterior. 



5.1 Gibbs sampling 

Suppose (M, H) follow their correct joint posterior distribution 11. Given H = h, 
generate a new random variable M' from the conditional posterior distribution 
fj32|) : then clearly {M',H) ~ 11. Alternatively, given M = /i, generate H' from 
the conditional posterior distribution ()33|): now {M,H') ~ 11. Thus the joint 
posterior distribution is invariant under regeneration of either M of -ff in this 
way. 

This suggests a way of simulating from the joint distribution 11. First generate 
Ml somehow. Then, if Mi = /i, generate Hi from distribution (j33|l . Next, if 
Hi = h, generate M2 from (jH2I); and so on. This procedure defines a Markov 
transition kernel from (Mj, Hi) to (Mj+i, i?i+i), and the desired joint distribution 
n is a stationary distribution of the Markov chain. // the density vr of 11 is 
everywhere positive (and also under weaker conditions) then the Markov chain 
will be ergodic, 11 will be its unique stationary distribution, and the limiting 
distribution of (M„, Hn) will be 11, no matter how the chain is started off. Hence 
by running the chain for long enough we can ensure we have sampled from the 
desired joint distribution H. There are many delicate implementation issues, such 
as how long to run the chain to ensure a close approximation to the stationary 
distribution, and how to collect a large sample from that distribution — e.g., 
by running many parallel independent chains and taking their final values, or by 
running just one and taking suitable well-separated terms in the sequence. 

Once we have such a random sample, say (yUi, hi), . . . , (/iat, h^), we might esti- 
mate E{(7(M)}, say, by "^f^^ g{fii). However, we can do better by utihs- 
ing ()32|1 . which (with luck) allows us to calculate k{H) := E{g{M) \H}. Since 
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E{k{H)} = E{5f(M)}, we can alternatively estimate E{5((M)} by N'^ Y.t=i K^i)- 
Since var{5f(M)} = var{A;(if)}+E[var{5f(M) | i^}], we have var{A;(if )} < \ai{g{M)}, 
and we should get a more precise estimate by this procedure — which could 
in principle be iterated. Because of an obvious parallel, this is termed "Rao- 
Blackwellisation" . 

If we wanted an estimate of the marginal distribution of M, we might use the 
"empirical distribution" of the (/Zj), putting probability at each (assumed 
distinct), so that Prob(M G A) is estimated by the proportion of the (/ij) falling in 
the set A. However this discrete estimate of a typically continuous distribution 
is unsatisfying. A genuine (and better) density estimate is obtained by Rao- 
Blackwellisation: 

N 

7r(/i) = A^-^ ^7r(/i I hi). 

i=l 

This technique, Gibbs sampling, is just one of a number of Markov Chain Monte 
Carlo (MCMC) algorithms for sampling from a posterior distribution by simulat- 
ing from a Markov chain that has that as its equilibrium distribution. Such com- 
putationally intensive methods allow us to perform otherwise intractable Bayesian 
inferences, and their development and application forms a major strand of mod- 
ern Bayesian statistics. 



Chapter VIII 



Decision Theory 



Suppose we need to choose an action a G A, the consequence of which depends 
on the value 9 of 9, this being measured by a loss function L{9, a). We suppose 
L bounded below (we often take it to be non- negative): then its expectation over 
any randomness introduced into its arguments will be well-defined (possibly +00) 
and can be calculated as a repeated expectation in any order. 



1 Bayes act 

Suppose my current uncertainty about 6 is expressed by a Bayesian distribu- 
tion n over 6. If I take act a, my (subjective) expected loss is L{Il, a) := 
EQr^-!jL{Q,a). I can use this to rank the available acts. Any minimiser an of 
L(n,a) over A is called a Bayes act (with respect to 11); the minimised value 
if(n) := infagA -^(n, a) = L(n,an) is the Bayes loss under 11. 



1.1 Bayes estimate 

For the following examples we assume 6, A C R. We regard quoting an estimate 
of O as an act a. 



Squared error loss Suppose L{6,a) = [6 — a)^. Then the expected loss, 
for estimate a, is var(O) + {a — E(^)}^. Hence the Bayes estimate is just the 
expectation of 6, and the Bayes loss is the variance of G. 
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Absolute error loss Now consider L{6,a) = \a — 6\. Then the Bayes estimate 
is the median m of the distribution of B, i.e. that value (for simplicity assumed 
existent and unique) such that Prob(0 < m) = Prob(6 > m) = i. For by 
considering all possible orderings of 6, m, a, we see that 

L{6, a) — L{6, m) > {m — a) sign(^ — m) 

and so, since E{sign(0 — m)} = 0, E{L(0,a)} > E{L(G,m)} for any choice of a. 

As a variation, suppose L{6,a) = ci{a — 6) if 6 < a, and 02(6 — a) otherwise 
(ci,C2 > 0). Let q := €2/(01 + C2). Then the Bayes estimate is the q-quantile of 
the distribution of 6, i.e. that value 6q such that Prob(6 < 6q) = q. 

0—1 loss Suppose now L{6, a) = if \6 — a\ < k, else 1. The Bayes estimate is 
that value of a maximising the probability that 6 lies in the interval a ± k. For 
small k this will be, essentially, the mode of the density it (6). 

For the case of discrete 6, with loss L{9, a) = if a = 9, else 1, the Bayes estimate 
is the discrete mode, i.e. the value of i maximising Prob(9 = i). 

2 Decision rules 

Suppose now that, before having to choose an act, we can conduct an experiment 
S, observing X, with distribution governed by 0, and so learning more about 
G. A Bayesian would naturally apply the procedure above using the posterior 
distribution given the observed value x of X. However, a frequentist must proceed 
differently: by choosing between procedures for using any data that might be 
observed. 

Definition 1 A (non-randomised) decision rule is a map d : X — t- A. Its risk 
function is the function R{e, d) = Ee[L{6', d{X)}] of 9 e Q. □ 

For example, with A = 6 = R we might use squared error loss: L{9, a) = [9 — a)^. 
Then a non-randomised decision rule d : X — )■ [R is an estimator, and its risk 
function R{9,d) is just its mean-squared error function mse(6'), as considered in 
detail in Chapter lllll 

In the case of finite G, with points labelled 6*1, ... , 9n, we can think of the risk 
function of a decision rule d point {R{9i, d),..., R{9n, d))^ e R". 

A randomised decision rule d is formed by randomly choosing a decision rule 
according to a known probability distribution over the space of such rules (inde- 
pendently of both X and O). Its risk function is calculated by taking a further 
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expectation over this distribution. The set of all possibly randomised decision 
rules will be denoted by D, with D° its subset of non-randomised rules. 

An apparently more general concept is that of a behavioural decision rule. Such a 
rule S associates with each a; G X a probability distribution over A (independent 
of 0) — again, the risk is defined by an appropriate further expectation. How- 
ever, unlike the case for a randomised decision rule, the dependence between the 
outcomes of the randomisations for different x-values remains unspecified. But 
given any randomised decision rule d we can define a behavioural rule S, which 
associates with each x the distribution of d{x) (where d is random); and 6 and d 
will have the same risk function. 

Once a preferred decision rule d has been selected, and X = x observed, the 
corresponding (non-random or randomised) action a = d{x) is taken. 

The risk set is TZ := {R{-, d) : d & D} {a. subset of R" for finite 6). The technical 
reason for allowing randomised decision rules is that TZ is then convex: i.e., if 
x,y E 71 and < p < 1, then z := px + {1 — p)y G TZ. For if x = i?(-, di),y = 
R{-, d2), we have z = i?(-, d) where d is the randomised decision rule that chooses 
di with probability p, d2 with probability 1 — p. 

In the frequentist approach we compare decision rules in terms of their risk func- 
tions. We consider di at least as good as d2, and write di ^ d2, if R{0, di) < 
R{6,d2), all 6' G 6. In this case we may say that di weakly dominates d2. If 
moreover we have strict inequality for some 6, we write di -< d2, and say di dom- 
inates d2. But :< is only a partial order: typically R{6, di) < R{6, ^2) for some 
9 E B, R{9', di) > R{9', ^2) for some other 9' G 6, and then di and ^2 are simply 
incomparable. 

If d is dominated we call it inadmissible, else admissible. 



3 Convex loss 

We say that the decision problem is convex if the action space A can be repre- 
sented as a real interval and, for each 6* G 0, the loss function L(9, a) is a convex 
function of a on A. 

For example, the squared-error estimation problem, with G = A = [R and L{9, a) = 
(6* — a) ^, is convex. Any decision problem with only a finite number of available 
actions is non-convex. 



Extension to a convex subset of a general vector space is straightforward. 
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Theorem 1 In a convex decision problem, any behavioural decision rule 6 is 
weakly dominated by a non-randomised decision rule. 

Proof. For each x, let d{x) := E{6{x)}, where the expectation is under the 
distribution 6{x) over A. Then d{x) G A, so d is a non-randomised decision rule. 

Recall Jensen's inequality (see Chapter IIVI S I2.1|) : for any convex function g of 
a real random variable Y, and any distribution P for Y, g{E{Y)} < E{g{Y)}. 
For each x and 6, apply this to g{-) = L[9,-), under the distribution over A 
determined by 6{x), to obtain 



Taking a further expectation over X ~ yields R{9,d) < R{9,6), so that d 



Corollary 2 Under the above conditions, any randomised decision rule is weakly 
dominated by a non-randomised decision rule. 

Thus, in a convex decision problem, there is never any point in randomising. 

3.1 Sufficiency 

Let the statistic T, with values in T, be sufficient for G. Given any (say non- 
randomisedjl decision rule d, let 6 be the behavioural decision rule, based on 
T alone, that assigns, to t G T, the distribution over A of d{X) given T = t; 
this being determined by the distribution of X given T = t under Pg, which, by 
sufficiency, is in fact known, independently of the value 6. Then, conditional on 
T = t and = 6*, both d and 6 determine the identical distribution over the 
action space A; hence they have identical risk functions. Effectively, 6 mimics, 
by external randomisation, the known probabilistic way in which X arises fol- 
lowing observation of T. So any decision rule is weakly dominated by a (possibly 
randomised) decision rule based on T alone. 

The following result, generalizing Theorem El of Chapter IIIH now follows from 
Theorem ^ 

Theorem 3 (Generalized Rao-Blackwell Theorem) In a convex decision prob- 
lem, any decision rule is weakly dominated by a non-randomised decision rule 
based on the minimal sufficient statistic alone. 

^The argument is easily extended to handle the case that d is a behavioural or randomised 
decision rule. 



L{e,d{x)} < EL{e,5{x)}. 



(1) 



weakly dominates 8. 



□ 
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4 Selection criteria 

It would appear unwise to use an inadmissible rule. But typically there will be 
many admissible rules, all necessarily mutually incomparable. Other criteria are 
then needed to select between them. Here we consider two: Bayes and minimax. 

4.1 Bayes rules 

Definition 2 Let 11 be a distribution for O over 6. The Bayes risk of a decision 
rule d is 

r{U,d) := Ee^n{i?(e,rf)}. 

A decision rule d* is Bayes (with respect to 11) if it minimises r(n, d) over d G D. 

□ 

Bayesians do not need the added generality of randomised decision rules: 

Lemma 4 // there exists a Bayes rule, there exists a non-randomised Bayes rule. 

Proof. Supposed d* is a randomised Bayes rule. It arises by generating a 
random D from some distribution A over D'^, independently of 6 ~ 11. Then 
r{U,d*) = Ee^nR{&,d*) = E©^nED~Ai?(e, Z^) = E^^AEe~n/2(e, D) = Enr.Ar{Il, D). 
But, for all d, r(Il,d*) < r{U,d). If we had r(Il,d*) < r(Jl,d) for all d G D° we 
would have a contradiction. So {c? G D° : r{Il,d) = r{Il,d*)} is non-empty, and 
this is just the set of non-randomised Bayes rules. (In fact the distribution A 
defining d* must pick a member of this set with probability 1). □ 

Given a prior distribution 11 for G, we now have two Bayesian approaches to 
decision theory: 

Extensive form For any given data X = x, choose a Bayes act ax in the pos- 
terior distribution: 

ax = argmin E{L(6, a)\X = x}. (2) 

Normal form First choose a (non-randomised) Bayes rule du'- 

du = arg min Ee~n -R(0, d). (3) 

Then apply this to the observed data x. 
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Theorem 5 (Fundamental theorem of Bayesian decision theory) 

Normal and extensive form Bayesian analyses deliver the same result. 

Proof. Starting from an extensive form approach, we can construct the associ- 
ated non-randomised decision rule 6, such that 6{x) = ax- Then for any decision 
rule G D° 

r{U,d)-r{U,S) = E[E{L{e,d{X))-L{Q,S{X))\Q}] 
= E[E{L{e,d{X))-L{e,6{X))\X}] 
> (4) 

since E{L(6, d{x)) — L(9, S{x)) | X = x} > for all x. So the decision rule 6 con- 
tructed from an extensive form analysis is a normal form Bayes rule. Conversely, 
if dji is normal form Bayes we have 

E[E{L(e, du{X)) - L(e, 5{X))} \X]<0 

and since, by definition of 6, the integrand is non-negative, it must be (with 
probability 1). So duix) is a Bayes act in the posterior distribution given X = x. 

□ 



Corollary 6 The minimised Bayes risk r{Il,dn) is E{H{Ilx)} , where H{Ilx) is 
the Bayes loss for O ~ 11^., the posterior distribution given X = x. 

Since extensive form analysis requires minimisation only over the action space 
A, rather than over the much bigger space D° of non-randomised decision rules 
as required for normal form analysis, it gives a much easier way to construct a 
Bayes rule. 

Any rule that dominates a Bayes rule must be Bayes for the same prior. Conse- 
quently if a Bayes rule with respect to 11 is unique, it is admissible. 

More generally, a Bayes rule will be admissible under some additional conditions. 
Here we note the following weaker result: 

Theorem 7 Let du be a Bayes rule with respect to U, and d a decision rule that 
dominates du- Define A := {6 e B : R{e,d) < R{e,du). Then U{Q e A) = 0. 

Proof. Follows easily from R{e, d)-R{9, du) < and En{R{Q, d)-R{Q, dn)} > 
0. □ 



Corollary 8 Suppose 6 is discrete, and 11(0 = ^) > for each 6 E Q. Then the 
Bayes rule dn is admissible. 
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4.2 Minimax rules 

For those who don't hke priors, an "objective" (but arbitrary) criterion for se- 
lecting a decision rule is "minimax": 

Definition 3 A decision rule d* is said to be minimax if it minimises the worst- 
case risk: 

d* = arg inf < sup R(9, d) > . 

□ 

The minimax concept is fundamental to the theory of games against an intelligent 
and malicious opponent. In a statistical context, where the opponent is "Nature" , 
who chooses the value 9 of 9, it might be considered unduly pessimistic. 

In general a minimax rule may need to be randomised. And it need not be 
admissible, although any rule that dominates it must have the same worst-case 
risk. 



5 Inadmissibility of the multivariate normal mean 

Here is one of the more surprising results in decision theory: the obvous (minimax, 
maximum likelihood, best unbiased, best invariant,. . . ) estimator of a collection 
of 3 or more normal means is inadmissible] 

Theorem 9 (James-Stein) Suppose X = 6 = W , and the loss function is 
quadratic: L{6 — a) = ||6' — ap. Our model for X = {Xi) has independent normal 
components: X^ ~ A/'(6j, 1). Then the MLE ofQ is Q = X . However, forp > 2, 
is dominated by 

e^ = {l-a\\X\\-^)X (5) 

forO<a < 2{p-2). 

Proof. We easily compute R{9, 0) = p. 
Also 

i?(e,0«) = Ee 

^ fn.. .,,9 . XT^(X-e) ] 
= Ee^\\X-er-2a + ^ J 



x-e- 



aX 

\\x¥ 
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Now for any reasonable h, using integration by parts, 

Ee{{X,-ei)h{X)} = Eg{h[^{X)} 

where := dh{x)/dxi. (This is Stein's lemma). Take h{x) = Xi/||a;p, so 

= - 2x^/\\x\\^. Then 

j V II II j 

= (p-2)E,(||X||-2). 

Hence 

R{e,Q^) = p-a{2{p-2)-a}Ee{\\X\\-^). 

□ 



Each 6q (0 < a < 2(p — 2)) dominates the minimax estimator G so is itself 
minimax (with worst-case loss p). For any 6, R{6, 6q,) is minimised at a = p — 2, 
so 6p_2 dominates all other estimators in this class, which are thus inadmissible — 
including, for a = 0, the MLE Q = X. But even 6p_2 is inadmissible: it is 
dominated by {(1 — {p — 2)/||Xp}+X, where := max{z, 0} (and even this 
estimator is inadmissible. . . ) 

5.1 Hierarchical Bayes and Empirical Baye^ anal- 
ysis 

"Shrinkage estimators" similar to 0^ of jSJ arise in Bayosian analysis. 

Suppose that, in that example, the prior distribution for is specified in terms of an 
auxiliary "hyperparameter" V. Given V = i), ~ A^(0, vIp). We also need to assign some prior 
distribution to V: this yields a simple hierarchical Bayesian model, which allows us to learn 
from the data about the hyperparameter V of the first-stage prior distribution. 

The Bayes estimator is E{e | X) = E{E{e \V,X)\X} = E{V/{1 + V)\X} X. We thus 
need the posterior expectation of W := V/{1 + V). But conditional on V alone, we have 
X ~ M{0, (1 + V)Ip} = Af{0, (1 - W)~^Ip}. A sufficient statistic for W in these distributions 
is ll^lP, with distribution (1 — W)~^Xp- We could now find the posterior distribution of W 
given \\x\\^ by combining this model with whatever prior is used for W. 

In practice this "complcat Bayesian" analysis can be demanding. A short-cut ('^empirical 

Bayes") is to replace the posterior expectation of W given with some point estimate of W 

based on Since E(||X|p | W) = p{l — W)~^, one possible estimate is w := (1 — p/||a;|p)+. 

Using this produces the estimator 0p. Alternatively, note that, for p > 2, £(1/11X11^ \ W) = 

(1 — W)/{p— 2), suggesting the estimate to := {1 — (p — 2)/||a;|p}+: this yields the positive-part 

James-Stein estimator ©it n. 

o 

The original meaning of this term, introduced by Herbert Robbins, has now been largely discarded in favour 
of the usage described here 
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6 Two simple hypotheses 

The simplest decision problem is when both parameter- and action-space have 
just two elements: say 6 = {^o, ^i}, ^ = {o-o, oi}. The loss function is L{9i, aj) = 
hj, where we suppose /qo < ^oi, ^ii < ^lo, so that the optimal act when is 
known to take value 6i is {i = 0, 1). Although this formulation and most of 
the analysis below treats 6*0 and 6i entirely symmetrically, in some approaches we 
distinguish between the null hypothesis Hq : Q = 6q and the alternative hypothesis 
Hi : Q = $1, and treat these differently. Then ai is interpreted as "reject Hq" 
and ao as "accept Hq" (or, at any rate, "do not reject Hq"). 

We do not restrict the sample-space X. The density (with respect to a suitable 
underlying measure /x) of X given 6 = is denoted by pj(-). The likelihood ratio 



A behavioral decision rule is defined by a function : X — ?■ [0, 1], where (f){x) is 
the probability of taking action Oi after observing X = x. Its risk function is 
given by 



where a := E{0(X) | ^o} is the overall probability of taking action ai given B = 
^0, the type I error of 0; and /3 := 1 — E{0(X) | 6i} is the overall probability of 
taking action qq given Q = 6i, the type H error. 

We also term a the size of the test 0, and 7 := 1 — /3 its power. Thus 0^0' <^==^ 
a < a' and (3 < (3'. 

For a non-randomised rule takes values and 1 only. It is equivalently deter- 
mined by its critical region. C = {x : 0(x) = 1}. Then a = Prob(X G C | 6 = 
^0), /3 = Prob(X G C I 6 = ^i) (where C denotes the complement X \ C of C in 
X). 

6.1 The risk set 

The risk set is {{R{6q, 0), R{9i, 0)) : G D}. From equations dH}, this is a simple 
transformation of the set S of error-pairs {(a,/?) : G D}, a subset of the unit 
square. So without any real loss of generality, we may assume = if i = j, 1 
otherwise, when the risk-set is just S. 



isA:=X{X) ■.= Pi{X)/pq{X). 



R{9q,<P) 




(6) 



Since we are allowing randomisation, 5* is convex; further if {a, /3) G 5*, corre- 
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spending to (j), then (1 — a, 1 — /3) G S, corresponding to 1 — Moreover S is 
closed. Any set 5* C [0, 1] x [0, 1] with these properties can serve as a risk-set for 
some experiment. An example is shown in Figure^ 




0.6 - 

0.5 - 

0.4 - 

0.3 - 

0.2 - 

0.1 - 

0- 

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 

Figure 1: Risk set for testing simple hypotheses 

The admissible rules are those corresponding to points on the lower boundary of 
S. 

6.2 Bayes rules 

Let now have prior distribution 11, where 11(0 = 6i) = Hi {i = 0, 1). The Bayes 
risk of (j) is then 

r(n,0) =7roa + 7ri/3. (7) 

Finding a Bayes rule is thus equivalent to finding the rule that minimises a 
linear combination of the two error rates with non-negative coefficients. From 
Corollary |H1 of Chapter IVllll when neither coefficient is such a rule will be 
admissiblejj In terms of the risk set, it will correspond to a point found by 

* Admissibility need not hold if we allow zero coefficients. Thus the rule </> = 0, corresponding 
to the point (0, f ) G 5*, is Bayes for prior tti = f , ttq = 0; but it will be dominated if there exists 
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sliding down a line of slope — tto/tti until it is just about to leave S: see Figure EI 
Its intersection with S is then non-empty and completely contained within the 
lower boundary of S; any rule corresponding to a point in this intersection will 
be a Bayes rule with respect to 11 




Figure 2: Bayes rules are admissible 

The figure suggests that any point on the lower boundary can be characterised 
in this way and is thus a Bayes rule: this is indeed the case, and can be proved 
using the supporting hyperplane theorem for convex sets. We end up with a (1, 1) 
correspondence between admissible rules and Bayes rules so long as both error 
probabilities are positive. 

Finding Bayes rules 

Finding a Bayes rule — equivalently, minimising a linear combination or the error 
rates — is really quite easy. 



any other, non-trivial, rule with a ~ 0. 
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Theorem 10 Let k > 0. A rule (j) minimises ka + (5 if and only if: 

Pi{x) > kpo{x) =^ (f){x) = 1 
Pi{x) < kpo{x) =^ (f){x) = 0. 

With k interpreted as tto/tti, this resuh is just a special case of Theorem 
since we can interpret the required minimiser as solving a normal form Bayes 
analysis, and (jH)) as the extensive form solution to that problem. (Note that this 
corresponds to choosing the act with the higher posterior probability.) 

We call a test of the form (jH)) a likelihood ratio test, with cut-off k, LRT(/c). 
When po{x) > on X, (jH)) is equivalent to: 

X{x) > k ^ (j){x) = 1 
X{x) < k ^ (j){x) = 0. 

When {x : pi{x) = kpQ{x)} has measure 0, an LRT(/c) is essentially unique. 

Note that, since (0, 1) G S* (it corresponds to = 0), a LRT(O) must have f3 = 0. 
We can similarly allow /c = oo: a LRT(oo) has a = 0. 

Corollary 11 Any LRT(k) with < k < oo is admissible. 
6.3 Ney man-Pear son approach 

The following simple Corollary (which was proved directly in STATISTICS IB) 
to the Bayesian result of Theorem ITUl is the basis for the Neyman- Pearson (N-P) 
frequentist approach to hypothesis testing. 

Theorem 12 (Neyman-Pearson lemma) Let (j) be a likelihood ratio test, with 
associated error probabilities {a, /3) and power 7 = 1-/3; and let be any test. 
Then 

a' < a ^ 7' < 7. 

This result can be expressed as: Any LRT is most powerful for its size. Note that 
a most powerful test need not be admissible: we might have a' < a and 7' = 7. 
This can happen for a LRT(O) (with 7 = 1). 

The N-P approach is to preselect the desired size a (e.g. at 5%), and adjust the 
cut-off k to yield a LRT of this size — which will then have maximum power 7 
subject to size < a. Note that this introduces an asymmetry into the treatment 
of the two hypotheses. 





Chapter IX 



Multivariate Analysis 



1 Linear multivariate analysis 

Multivariate analysis deals with relationships between several variables, consid- 
ered on an equal footing. 

Let Vi,...,Vp be the basic variables of interest — for example, Vi = H := 
"log(height in meters)", V2 = W := "log(weight in kg)", .... (In this course 
we are only going to deal in linear operations, so if we also wanted to include 
e.g. untransformed height we would need to introduce it as a new variable). The 
string of basic variables {Vi, . . . , Vp) will be regarded as a (1 x p) row- vector V_. 

We can construct new variables by linear operations, e.g. B := W — 2H is the 
variable "log(body mass index)". We shall consider all such linearly constructed 
variables, of the form W = V_sl, where a is a (p x 1) column vector, as essentially 
on a par with the basic ones. We obtain a ]?- dimensional vector space V of 
variables, the basic variables Vi, . . . ,Vp forming a basis of V. 

An individual will have a value Xj — considered as random — for any basic 
variable Vj. These will be strung into a, [1 x p) random row- vector X, with some 
multivariate distribution P, say. The value for any variable W = V_Si G V is thus 
= which will have an induced univariate distribution Pw- 

Lemma 1 The joint distribution Py is determined by its collection of univariate 
marginals {Pw ■ W & V}^ 

Proof. P is determined by its multivariate characteristic function (pp : IR^ — )■ C; 

^Note: It is not enough to consider only the basic variables (Vj). 
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but 4>p{sl) = Epfexp i Xa) is determined by the distribution Pw of Xw = Xsl. 

□ 



Let fij = E(Xj). The expectation of X. is the vector fi = (/ii, . . . Then for 

W = VjaL, fia is the expectation of X^; and clearly fi is fully determined by all 
such univariate expectations. 

Let (Tjfc = cov{Xj, Xk) = E{(Xj — fij){Xk — /ifc)}. In particular, ajj = var(Xj). 
The {p X p) matrix S = {ajk) is the dispersion or covariance matrix of X. Thus 
S = cov(X^,X) := E{(X - /i)}. For W = Vsl, vaT{Xw) = a^Sa. 
Since this must be non-negative for any a G R^, S must be (symmetric and) 
non-negative definite. It is positive definite if and only if, whenever a 7^ 0, the 
univariate marginal Pw ofW = Vsl is non-trivial {i.e., not a 1-point distribution). 
In this case K := is termed the precision matrix (or concentration matrix) 
of X- 

In accordance with Lemma Q we can recover S from the univariate marginals: 
iajk = var(Xj + Xk) - var(Xj - Xk). 

Let Y_ = XA, Z = XB, with A {p x q), B (p x r), be the values for new sets of 
variables ]/^, V_B. Then cov(r^, Z) = com\a^X^,XB) = A^SE. In particular 
the dispersion matrix of F = is A^EA. 

The correlation between Xj and is 

cov(Xj,Xfc) _ ajk 



pjk = corr(Xj,Xfc) : = 



^var(Xj) var(Xfc) y/o^j~o\k 
By Cauchy-Schwarz, any correlation lies in [—1, 1]. 



2 Multivariate Normal Distribution 

Definition 1 We say that X has a multivariate normal (MVN) distribution if, 
for any constructed variable W = ]/a, its measurement Xy/ = Xa has a univari- 
ate normal distribution. □ 

It immediately follows that, if X (1 x p) is MVN, then so is any linear transfor- 
mation Y_ = XA, for fixed A {p x q). 

Let the mean- vector and dispersion matrix of X be and S. Then the distribu- 
tion of Xw must be the univariate normal distribution A/'(/ia, a^Sa). We thus 
write 

X~Arp(^,S) 
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if, for any a G R^, 

Xa~ Ar(^a,a^Sa). (1) 
The following is then immediate: 

Lemma 2 // X ~ J\fp(n,T.), and Y_ = X^A for fixed A (p x q), then Y_ ~ 
Mq{^A,A^T.A). 

By Lemma there can be at most one distribution satisfying (^J for all a G IR^. 
Indeed, because the characteristic function of A/'(m, v) is given by t H- exp^imt — 
^vt"^), that of A^(/i, S) (if such a distribution exists) must be given by 

0(a) = exp (ifia a^Sa j . (2) 



We note that, since a linear combination of independent normal variables is nor- 
mal, if Z = (Zi, . . . , Zp) with the [Zj) independent and identically distributed 
Af{0, 1), then Z_ ~ Mp{0,lp). Thus A/'p(0,/p) exists. Now let G R^, and let E 
be an arbitrary symmetric non-negative definite {p x p) matrix. We can write 
E = A'^A for some A [p x p). Defining X = /i -|- Z_A, where Z_ ~ A/^(0, Ip), it is 
readily seen that 2L ~ A/^(/^7 S). Thus for any (1 x p) mean vector /i and {p x p) 
dispersion matrix E, a p-variate MVN distribution with these attributes exists, 
and is fully determined by them. 



2.1 Density 

Because it has independent and identically distributed M{0, 1) entries, the joint 
density for Z_ ~ A/^(0, Ip) is readily computed: 

PzU) = 7z\t- exp —zz^. (3) 

Now let X ~ N'p^fi,!!), with E positive definite. We can write E = A'^A with 
A {p X p) non-singular; then X = /i -|- Z_A, where Z = (X — fi)A'~^ ~ A^(Q, Ip). 
The Jacobian of the transformation x^zis (dxj/dzk) = A"^, with determinant 
(det E)^. Applying the multivariate change-of- variables formula to Q, we obtain 
the density of Afp{iJ,, E): 

Pxix) = --, rexp — (x — n)T.~^(x — ii)^ 

■ exp — -trE~^(x — /i)"'"(x — /i). (4) 



(27r)2P(detE)2 2 
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What if S is only positive scmi-dcfinitc (and hence singular)? In that case there exists a 
with Ea = 0, and so var(Xa) = 0. So the distribution of X_ is confined to the affine subspace 
{x : (x — fj,)a = 0}, and so can not have a density with respect to Lebesgue measure on 
RP. Nevertheless, we can still have a perfectly satisfactory MVN distribution according to our 
definition. 



2.2 Bivariate case 

When p = 2 we will usually write X. = i^,'^), = if^Xyf^'v), = o-xx = Cx, 
(722 = o-yy = CTy, and pi2 = p. 

We can consider the joint distribution as made up from the marginal distribution 
of X, and the conditional distribution of Y, given X: 

Lemmas Let (X, F) ~ ^(/i, S). Then 

X ~ Afifix,crxx) (5) 
Y\X ~ J\f{a + l3X,ayy.x) (6) 



where 



/3 = CTxr/crxx (7) 
a = fly - (3 fix (8) 



(yyy-x = (^YY • (y) 

o'xx 

Proof. Property (jS} is immediate from the definition of multivariate (here 
bivariate) normality. 

Let R = Y — a — f3X. We readily compute E(i?) = 0, var(i?) = cryy.x, 
cov{X,R) = 0. In particular, i?_LLX so that, given X, R Af{0,ayy.x)- Prop- 
erty © follows. □ 



It can be checked that (Jyy.x = kyy, the corresponding entry of the precision 
matrix K = T,~^ (when this exists). 

The function (of x) E{Y \ X = x) = a + (5x is the regression (line) of Y on X; 
l3 is the regression coefficient. The variableU Y * X := I_{Y \ X) = a + j3X is 
the linear predictor of Y based on X. The differenc^ Y ■ X := Y — Y * X 
(= R) is the residual, and its variance ayy.x is the residual variance, of Y (after 
allowing/adjusting for X). 



^This notation is non-standard! 
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More generally, for vectors X and F , we have linear predictor 

with TiXY = covfX"*", Y), etc.; and residual vector Y-X = Y — Y*X. 
2.3 Partial covariance and correlation 

Now suppose p = 3, X = (X, Y, Z) etc. We can adjust both Y and Z for 
X, as above, yielding residuals Y ■ X, Z ■ X. We compute cov(y ■ X, Z • X) = 
o'yz-x '■= cfyz — ctxyO'xz / o'xxi the residual or partial covariance between Y 
and Z, after allowing for X. Correspondingly the residual /partial correlation is 

PYZ-X = O-YZ-x/ y/C^YY-XCTZZX- 

The vector extension (when S^^^ exists) is 

Syz-X = C0V(F ■ X""", Z_ ■ X) = Sy^ — EyxSj^^SjfZ- 

In particular, the residual/partial dispersion matrix of F, allowing for X, is 
Syy.x = ^YY — ^YX^x\:^XY- Moreover, if S is the dispersion matrix of (X, Y), 
and K = exists, then T^yy-x = -^yy- 



3 Data 

In applications we will have measured the value of each variable Vj on each of 
a collection of individuals ^i, . . . ,^n- Let Xij be the measured value for variable 
Vj on individual ^j. The (n x p) data matrix is X = Its ith row Xj = 

{xii, . . . , Xip) contains the measurements, for all variables, on individual ^f, while 
its jth column, Vj, contains all the data on variable Vj, across individuals; more 
generally, all the data on variable W = Vsl are contained in the (n x 1) vector 
w = Xa. 

The average data string, across individuals, is W := XlILi ~ n'^l'^X, where 
1 is the (n x 1) vector of Is. Correspondingly the average value for variable 
W = V_a is tZJ = xa = ra'^l^Xa. 

Let H := n~^J, with J = 11^ the (n x n) matrix of Is. Then every row of X := 
HX is X. The [p x p) matrix if is a (rank 1) projection matrix: H = = H^. 

■^We have so far tried to use upper case symbols for variables, and lower-case symbols for 
their values. In dealing with matrices, however, it is common to use upper case for a matrix, 
and lower case for its entries. A notational conflict arises when we deal with random matrices; 
we will not be able to resolve it consistently. 
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Let H .= I — H . Again, U is a. {n x n) orthogonal projection matrix, this time of 
rank n - 1; and HU = UH = 0. 

The (mean) -corrected data matrix is X'^ := IIX, and has zth row — x_. The 
mean-corrected data on any variable W = V_sl are contained in the (n x 1) vector 
vf = X'^a = UXa. We see that linear operations on variables commute with 
linear operations on individuals. 

The {p X p) (corrected) sum-of-squares-and-products (SSP) matrix is 

■= {X^fX^ = X^UX. (10) 
Then a'^5''^a is the corrected sum of squares Y17=ii'^i — w)"^ for variable W = Va. 

3.1 Random sampling 

If the individuals can be regarded as randomly sampled from some 

population, then we can treat the n rows of X as independent and identically 
distributed random vectors 2£.ii ■ ■ ■ , 2£.n from some p-variate distribution P. Cor- 
respondingly each entry Xij of X is random, with independence across different 
i (but not necessarily across j). If the mean- vector and dispersion matrix of P 
are /i and S, then E(Xjj | /i, S) = fij, cov(Xjj, Xj/j/ \ fi,T.) = 6^ ajj'. 

Consider any variable W = Vsl. Its vector of values {Wi, . . . , VFri)"'" = are 
independent and identically distributed, each with expectation fiw = A^a and 
variance = a"'"Ea. From univariate theory, we know that unbiased estimators 
of i^w and are W = Xa and (n - 1)-^ - = {n- ly^aJS'^a. It 

follows that we have multivariate unbiased estimators: 

E(Z) = t (11) 
E{(n-l)-'5^} = S. (12) 

If yU is known, without loss of generality we take /i = 0: then an unbiased estimator 
of E is rT^S where S := ^ixjx_^ = X'^X is the uncorrected sum-of-squares-and- 
products matrix. 



4 Normal dispersion model 

Suppose now that the rows ( X ^ : i = 1, . . . ,n) of the data matrix X are inde- 
pendent and identically distributed from the normal distribution A/'(0, S) with 
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known mean- vector (taken as without loss of generality), so that the only un- 
known parameter is the dispersion matrix E. In particular, the entries (Xij) are 
jointly normal, with E(Xjj) = 0, cov(Xy, Xj/j/) = 6ii> ajj/. 

We shall suppose S non-singular. From (j3)), the overall joint density is 



p(X\E) = ^ —exp--tTj:^^S (13) 

(27r)>(detS)i" 2 

— exp - ^ I > ^ kiiSii + 2y ] kijSij 1 (14) 



(27r)5"P(detS)^ 



where K := J] ^ is the precision matrix. 



We see that (fT^ is of exponential family form. We can take S (more precisely, its 
upper triangle) as the natural sufficient statistic; then the mean- value parameter 
is nS, and the natural parameter is K. The likelihood equation 

equates the natural statistic and the mean-value parameter, yielding MLE S = 
n~^S, which is unbiased. 



4.1 Wishart distribution 

The distribution of S" = X'^X in the above model is called the Wishart distribu- 
tion. It depends on E and n, and we write S ~ Wp{n; E). The parameter E is 
the scale parameter, and n is the degrees of freedom. When E = /p we obtain 
the standard Wishart distribution on n degrees of freedom. 

From univariate theory we have Wi{n:, (p) = (pXn- 

The following result is immediate from Lemma El 

Lemma 4 Let S ~ Wp{n] E), and A [p x q) be a fixed matrix. Then A'^SA ~ 



Corollary 5 Suppose S ~ W{n] E). Then for fixed a G R^, 

Si Ssi 2 

aTEa 

*To be precise: the values {—■^ka : I < i < p}, {—hj : 1 < i < j < p}- Alternatively we can 
take K itself as the natural parameter; the associated natural sufficient statistic then requires 
compensating adjustments to the entries of S. 
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Corollary 6 Su/au ~ xl- 

Lemma 7 Let S be (p + q) x (p + q), with the first p rows/columns corresponding 
to variables Z, and the last q rows/columns corresponding to variables Y_. If 
S ~ Wp+q{n] S) {n > p), then Syy-z ■= Syy - SyzSz\Szy ~ Wq{n-p, T^yy-z)- 

(Proof relegated to Example Sheet.) 

Corollary 8 Suppose S ~ Wp{n; S). Then for fixed a G R'^, 

^T^-l^ ~ Xn-p+1- 

Proof. Consider first the case a = e := (0, . . . , 0, 1)""". Then a"'"S'~^a is the 
{p,p) entry of S~^, and its inverse is Syy-z, where Z_ corresponds to the first 
p — 1 variables and Y to the pth. Similarly for E. From Lemma |71the result holds 
in this case. 

For the general case, let S* := H^SH where H is non-singular with if^a = e. 
Then S* ~ Wpin; E*) with E* = HT.H"^ , and the {p,p) entry of (5*)"^ is a^^-^a 
(and similarly for E). □ 



Wishart density 

When we speak of the density of 5, this is to be interpreted as the joint density 
of its p{p + l)/2 mathematically independent entries (sjj : i < j), with respect to 
Lebesgue measure over the set of values for which S is non-negative definite. 

Let S ~ Wp{n] E). When n < p, S is singular, having rank n with probability 1, 
so does not have a density. 

For n > p, S is almost surely of full rank p, and its density then exists and is 
given by: 

Ps{S\E) = C(p,n)(detE)-5"(det^)i("-P-^)exp-^trE-^5. (15) 
The normalisation constant is given by 

C(p,n)-^ = 2>7rf(P-i)/^ JJr|^(n + i-l))|. (16) 

In particular, the density of the standard Wishart distribution W{n\ Ip) is 

P5(5|/p) = C(p,n)(det5)^("-f-i)exp-itr5. (17) 
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Reduction to standard form 



The following argument shows how, if we can establish fll7|l . we can easily deduce 

m- 

Since is a sufficient statistic for the family of densities fllH|l . for any data 
the likelihood based on S is proportional to that based on the full data X (see 
Corollary El of Chapter . So 

Ps{S\Ip) Px{X\Ip) ^ ' 

^ exp-l (tiJ:'^S -tiS) . (19) 



(detS)*" ^ 2V 
Substituting (fT7|) for the left-hand side denominator, we obtain ()15|) 



Bivariate case 

We here derive 1171 for the case p = 2. The ith row of the data matrix X will be written as 
iXi,Yi). 

We have U := Sxx ~ xl by CorollaryEl We now argue conditionally on the X's (so these 

can be regarded as fixed), and consider the zero-intercept sample regression of y on X, estimated 

from the data pairs (Xi ,Yi), . . . , {X„ ,Y„). From standard regression theory, the estimated 

regression coefficient is /3 := SxY / Sxx > and the unbiased estimator of the residual variance of 

Y, about its regression line, is := SyY-x/in — 1), with Syy x '■= Syy — S^y/^xx- Since, 

with S = 72 , the population counterparts of these quantities are and 1 , regression distribution 

theory tells us that (still conditional on the X's) and are independent, with respective 

-1 - - 

distributions ^(0,3-^^) and xS^-i/C" ~ !)• Hence if we define V := S^-^P = SxY / S^-^, 

W := Syy x, we have V ~ A/'(0, 1), W ~ Xn— l' independently. Moreover, since this joint 

distribution of {V, W) holds conditionally on the X's, but does not in fact depend on their 

values, while U := Sxx is entirely determined by those values, we must have {V, W) AL U ; and 

we know U ~ Xn- '^^^ joint density of {U, V, W) is thus: 

p,U,v,w)iu,v,^) = M^e-i-J-^ ■ -^e-h' ■ )l> e-i-^^("-^)"^ (20) 

Now change variables to Sxx = U, SxY = uiv, Syy = W + V^. The Jacobian of the 
transformation is J := det{d{sxx , sxY, SYY)/d{u, v, w)} = « 2 , so the transformed density is 

ps{S) ■.= ps{sxx,sxy,syy) = P{U,V,W){^,^,^)/J 

= C(det5)5("-=') exp-itr5 (21) 

where 

C-=2'^v^r(in)r(in4). (22) 



4.2 Sample correlation 

The sample correlation coefficient based on S is r.^ := Sij/ ^jSaSjj. When S ~ 
Wp{n\Y?), ri2 depends only on the leading (2 x 2) submatrix 5*2 of S, which 
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has distribution W{n] S2), S2 being the leading (2 x 2) submatrix of S. So to 
investigate the distribution of a sample correlation coefficient r we can restrict 
to the case p = 2, r = ri2. Let the ith row of X be {Xi,Yi). Then r = 
'S'xy/ V SxxSyy- 

Note that the value, and hence the distribution, of r is unchanged if we first trans- 
form to X* := Xi/^/oxx, Y* := Fj/y^ovy- This has the effect of transforming E 
to 

' 1 P 
P 1 

where p is the population correlation between X and Y. Hence the distribution 
of r depends only on p (and n); and to investigate it we may as well take S to 
have the form in 



(23) 



Derivation of the distribution of r is in principle straightforward, starting from 

the joint density fll5|) for {sxx, Sxy, Syy) (for n > 2), making a multivariate 

1 1 

change of variables sx = s^^x; — ^yy^ ''^ — ^xy/{sxSy), and integrating out 
sx and sy. But the integral can not be done in closed form. The final result can 
be expressed in various ways, one of which is 

p(r |n;p) = (^^^^^ (1 -p2)l"(i _r2)|("-3)^ (cosh u - pr^du. (24) 



We shall derive ()24jl for the special case p = 0, when we can take S = J2, so 
that the Xs and Ys are independent normal with unit variance. (In this case the 
otherwise problematic integral term in fl24|l is just a constant.) 



Introduce 



Then 



1 



t:={n-l)^-^. (25) 



Sxy/ Sxx 
^ySvY-x/in - 1] 



1 



^xx 



where (3 = Sxy /Sxx is the sample (zero-intercept) regression coefficient of Y on 
X, P = is the corresponding population value, and is the sample residual 
mean square (on n — 1 degrees of freedom) for Y about its sample regression 
on X. We recognise t as Student's t-statistic for testing the (true) hypothesis 
(3 = 0. Consequently, from standard regression theory, conditionally on the X's 
and hence also unconditionally, 

t^tn-l. (26) 
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Using the known form of the density for tn-i, 

(27) 

and applying the change-of- variables formula with (j2Sl), delivers the density of r 
on (-1,1): 

pir I n; 0) = ^ (1 - r^t^^-'\ (28) 

r(2^- 2)0^ 

But to test the null hypothesis p = it is more straightforward first to trans- 
form the observed value of r to t using fl25|l . and refer this to tables of the 
distribution. 

It would be nice if an argument similar to that based on 1181 could be used to transfer the "null" 
sampling distribution 1281 for r to the case p ^ 0, but unfortunately this does not work because 
r is not sufficient for the model with S of the form I23i . 

Asymptotic distribution 

It can be shown (see e.g. Anderson, 2003, §4.2.3) that, as n — )■ oo, 

v/^(r-p) A A/- {0,(1- pY}- (29) 

Now consider z := tanh~^ r = |log{(l + r)/(l — r)}, ( := tanh~^p. Since, for 
large n, r is close to p, and d{tanh.~^ r) / dr = 1/(1 — r^), we can approximate 
2 — C ~ {1/(1 — P^)}(^ ~ P)y so that z has asymptotic expectation ( and constant 
asymptotic variance 1/n, and we have 

v^(z-C) AAr(0,l). (30) 

Formula ()3()|1 yields a better approximation than ()29|1 . and can be used as the basis 
of approximate inference about (, and thereby about p = (e'' — e~'^)/{e'' + e"''). 

Sample partial correlation 

Sample partial correlations can be constructed from S in the same way as popu- 
lation partial correlations are constructed from S. On account of Lemma the 
distribution theory (both exact and asymptotic) is exactly as above, but with the 
degrees of freedom decremented by p, the number of variables adjusted for. 
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5 Normal location-dispersion model 

More realistically, we now consider the case of unknown mean-vector M as well 
as unknown dispersion matrix E (assumed non-singular). From the overall 
joint density is then 

1 1 " 

- '""'{-^"i^^"'i^''exp|„^S-'r-ltrE-4. (31) 



(27r)2«P(detS)2" I - 2 

Again ()3H1 is of exponential family form. The natural sufficient statistic can 
be taken as (^,5"). Since E(Xj) = /i and E(^7^i) = S -|- fi^fi, the mean- 
value parameter is -|- iJt^jJt)). Solving the likehhood equations, obtained 
by equating the natural sufficient statistic with the mean-value parameter, yields 
maximum likelihood estimates: 

^ = 1 (32) 
E = n-^S". (33) 

However, the biased estimator n'^^S^ is usually replaced with the unbiased esti- 
mator in — 1)~^S'^. 



5.1 Sampling distribution of sufficient statistic 

The statistic (X, S'^) is a (1,1) function of the natural sufficient statistic (X, S), 
hence minimal sufficient (although not itself a natural sufficient statistic). 

Lemma 9 X_ and S'^ are independent. 

Proof. The nx n matrix of covariances between the jth and kth columns of X 
is (JjkIn- Hence the vector of covariances between the jth column of X'^ = UX 
and the kth entry of X = n~^l^X is n~^H (Tjfc/„ 1 = since HI = 0. Thus 
every entry of X'^ is uncorrelated with every entry of X. Since the entries of X 
are jointly normally distributed, and we are dealing with linear transformations, 
zero correlation implies independence: X_LLX^. Hence, since S"^ = (X^)"'"X^, 
XALS^. □ 



The distribution of X is readily found: 
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X~Ar,(^,S/n). (34) 
To find the distribution of S'^ we argue as follows. 

Construct an orthonormal basis ai, . . . ,a„ of with ai = n^^l. Let A (n x 
{n — 1)) have columns a2, . . . , a„. It is easily checked that A'^A = while 
AA^ = n. Let X* := A^X, so that X* is (n-l) xp and = {X*fX*. Because 
E(X) = 1/i and A^l = 0, X* has expectation 0. Moreover it can be checked 
that the n — 1 rows of X* all have dispersion S, and are uncorrelated — hence 
(with normality) independent. It now follows from the definition of the Wishart 
distribution that 

~ Wp{u; E) (35) 

with u = n — 1. 

Thus if we base inference about S on S'^, we can use all the results found for the 
normal dispersion model, after first replacing n by the new degrees of freedom 
throughout. 



5.2 Inference for mean- vector 

In the univariate case we can base inference for the single mean M on the Student's 
t pivot: 

t:=n^X-M)/si^, (36) 

where Sxx '■= is the usual unbiased estimator of the variance 

of X, on z/ = n — 1 degrees of freedom. Then t has the t^, distribution, defined as 
that of Z/ \fvjv when Z ~ A/'(0, 1) and V ~ x^? independently. Alternatively we 
can use F := = n{X — ViY / S xx-, which has the F distribution F^}, where F^ is 
defined as the distribution of {U j "p) j iy j v) when U ~ x^, ~ independently. 

The generalisation of to the multivariate case is Hotelling's T^: 



Hotelling's 

Lemma 10 Suppose X_ ~ J^p{0,lp) and S ~ Wplu; Ip) {y > p), independently. 
Define S := S/u, := XS'^Xj. Then (^^) ~ 

Proof. Let U := XX^, V := uU/T^ Then U ~ Xp- Also, by Corollary 
conditionally on X, ^ ~ xl-p+i- Since this conditional distribution does not 
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depend on X, in fact V is independent of X, and therefore of U. We thus have 

= ( ^ ^/P 

\u-p+lj V/{u-p+l) 

with U ~ ^ ~ X^-p+i independently; and so, from the definition of the F 
distribution, iU/p)/{V/ii^ - p + 1)} ~ F^.p+i- □ 

The (scaled F) distribution of T^ is termed Hotelling's T"^- distribution with pa- 
rameters p and I/, T^(p, v). 

Corollary 11 Let X. ~ A/^(0, E), 5* ~ Wp(i/; S) (u > p), independently, where 
S is positive definite. Then XS X ^ ~ ^^(p? 

Proof. Let A be non-singular with AA^ = Define X* := XA, S* := 

A'^SA. Then X* ~ Afp{0,lp), S* ~ Wp{u;Ip), and X;S"^X^ = X*(S^)-^X*^. 

□ 

Returning to our location-dispersion model, define 

T^ ■.= n{X-M)(S^)~\X-Mf (37) 
where S'^ = S'^/ {n — 1). This is one-sample Hotelling's T^. 

Since, given the parameters (M, S), n5(X — M) ~ A/'p(0, S) and is independent 
of S'^ ~ Wp{n — 1; S), we immediately deduce that ~ T^{p] n — 1). Since this 
distribution does not depend on the values of (M, S), is a pivotal quantity. 

We can thus test the hypotheses M = A* (with S unspecified) by referring n(x — 
fJ'){S'^)~^(x_ — /i)^ to tables of the T^{p; n — 1) distribution; equivalently, we refer 
{{n — p)/p} n(x_ — fi){S'^)~^(x_ — fi)"^ to tables of A (1 — a)-level confidence 

region for M can be formed as 

j/x: (^^) nix-^^)iST\E-^ir<F^,_p{a)^ 

where F^_p{a) is the upper a-point of the F^_p distribution {i.e., Prob{F^_p > 
-F^_p(a)} = a). This will be an ellipsoid centred on x_, with principal axes aligned 
with those of S^. 

Two-sample test 

Suppose wc have independent (p X 1) data-vectors ~ A^(M S) {i = l,...,njf), Y-i ~ 
M{M-Y' ^) {* = li • ■ • I "f^v)^ from two populations with the same dispersion matrix but possibly 
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different mean- vectors. The sufficient statistic is (X, Y, Sw) with Sm := S'^ which are mu- 

tually independent with respective distributions M(M-^, T,/nx), A/"fMy , S/ny ), and Wp{u; S) 
where u = rix +nY — 2. Let A := My — My . With Sm '■= Sw/in-x + «y — 2), inference for A 
can be based on the pivot (Hotelling's two-sample T^): 

t2 :=(n-^-Hnyi)-i(Z-Z- A)S;:\F-Z- A)T (38) 

with distribution T^{p; u). In particular, a test of the hypothesis My = M y can be conducted 
by referring {{u ~ p + l)/p} (n^^ + ny-'-)-i(y - x)S^^{y_ - xf^ to tables of F^_^j^-^. 



6 Classification 



Suppose the random (p x 1) vector X of observations on variables V_ is known 
to come from one of two populations, Hi or 112, having distribution S) if 

from Ilfc, with density 

= (2x)4et2)i ^V-\(x-!^)^-\x-!^f. (39) 
For the moment we suppose all parameters known, and ^ fi^. 

We observe X = x and have to assign the observation to one of the two popula- 
tions. The likelihood ratio statistic is A = A(X) where 



P2[X 

A[X) = 

where 



A(x) = = exp|(x-7J)a| (40) 

pi{x) - 



a := S-i(/i2-/i^)T. (42) 

A likelihood ratio test (LRT) will thus be of the form: "assign to 112 if Xa > k" . 
The variable W := V_Si is the linear discriminant function (LD) for this problem. 

When k = fiw '■= 7^^; this test is equivalent to assigning to 112 if A > 1. By 
general properties of LRTs, this rule will minimise the sum a + /3 of the two 
types of error. By symmetry it will have a = (3, and will be minimax. If we 
have non-equal prior probabilities for the two hypotheses, or differing losses for 
the two types of error, a different cut-off will be optimal. 



6.1 Another interpretation 



Suppose we decide to collapse our data to the observation on just one variable, 
say U = Vc. Then, in population Uk, Xu '■= X c will have the univariate normal 
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distribution A/'(/i^c, c"'"Ec). The separation between the two distributions can 
reasonably be measured by the distance between their means, measured in units 
of their (common) standard deviation; or, essentially equivalent, of its square: 



,2 ._ {(/^2-/^l)^}^ 



(43) 



We might now try to choose c to maximise (i^. Since is unchanged on multiply- 
ing c by a scalar, we may as well impose the normalisation constraint c"'"Ec = 1, 
and subject to this choose c to maximise {(/i^ — /i_^)c}^. Using a Lagrange 
multiplier 7, we have to maximise c^l^fi^ — fJ'^)^{fJ'^ — /i^^) — 7E}c. This gives 

{(^2 ^ fii^^^fi2 ~ ^1) ~ ^^^^ = °' = ^"H(/^2 ~ /^J^K/fs ~ /fi)^' ^hich is 

a scalar multiple of (/i^ — fJ'-J'^- Hence c oc I]~^{fj,^ — ^_^)^ = a. That is, choosing 

the variable U to be (proportional to) the linear discriminant LD will achieve 

maximum separation between its two univariate distributions. 

This maximised separation is 

It is termed the Mahalanohis distance between the two multivariate distributions. 
Note that it is invariant under a non-singular transformation of the variables. 



6.2 Data 



What if we do not know the parameters jj,^,fi^,Ti7 If we have data from each 
population, as envisaged in ^ 15.21 we can simply replace these by their sample 
estimates Sw) and proceed as before. 

The sample counterpart of the Mahalanobis distance is (using notation similar 

to that of S I5.2|1 {x_2 — x.i)S^ (^2 ~ W.i)'^, a multiple of Hotelling's two-sample 
statistic for testing the hypothesis = Mg. Since this hypothesis is false, the 
distribution of is somewhat complex]^ although it depends only on D^. Note 
that, because the sample separation is optimised for the specific data at hand, it 
will be biased upwards as an estimate of the population separation D'^, seriously 
so unless u ^ p. 



^It is a non-central T^-distribution, which is a scaled version of the non-central F- 
distribution. See Anderson (2003), § 5.2.2. 
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7 Principal components 

Let X, with non-singular dispersion matrix S, be the observation on a set of p 
variables V_. By an abuse of language we may also term S the dispersion matrix 
of V_. Under a non-singular change of variables, = V_A has dispersion matrix 
S* = A^SA. If we know S, it can be helpful to make a change of variables such 
that S* is the identity, i.e. the different variables are uncorrelated (independent 
if jointly normal) with unit variance. This can be achieved by using any matrix A 
which is a square-root of i.e. AA"^ = S~^. There are many such square roots: 
for example, Gram-Schmidt orthogonalisation is equivalent to using the unique 
upper-triangular square root. Or we can reorder the variables before constructing 
this. 

One useful transformation to uncorrelated variables (though not necessarily with 
unit variance, which could however easily be arranged by further scaling of each 
individual variable) is based on the spectral decomposition of the non-negative 
definite symmetric matrix S, in terms of its eigenvalues and eigenvectors. General 
matrix theory tells us that there exists a diagonal matrix A = diag(Ai, . . . , Ap) 
with Ai > . . . > Ap > 0, and an orthogonal matrix A, such that we can write 

A^EA = A. (45) 

Then Aj is an eigenvalues of S, satisfying det(S — Xjlp) = 0, and the jth column 
of A, aj, is a corresponding unit-length eigenvector (unique, up to a sign change, 
if the multiplicity of Aj is unity), satisfying Saj = AjHj. 

The transformed variables {V{, . . . , V*) comprising ]/* := V_A are the principal 
components^ They are uncorrelated, with vaT{V*) = Aj. 

Definition 2 A variable W = V_sl is termed normalised if Yl^=i '^i = 1- '-' 

Note: This definition is dependent on the choice of basic variables V_. 

Theorem 12 Among all normalised variables, the choice W = V{ , the first 
principal component, has maximum variance. Among all normalised variables 
uncorrelated with {V{, . . . , V^) {1 < k < p), the choice W = V^^^ has maximum 
variance. 

Proof. We have Vsl = V_*h with b = A^a.. Since A is orthogonal, a^'^a = b^b. 
So W is normalised if and only if it has the form Yli^i^i* with ~ 



^If a certain eigenvalue is repeated, with multiplicity m > 1, the associated principal com- 
ponents are not uniquely defined, but the m-dimcnsional space they span will be. 
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(In particular, each V* is normalised.) Because the {V*) are uncorrelated with 
variances (Aj), the variance of W is '^^bfXi. This is a weighted average of the 
(Aj) with weights (6^) summing to 1. It is clear that it will be maximised when 
all the weight is on the largest value, which will hold for 6i = ±1, 6j = (z > 1) 
(essentially uniquely so if the first eigenvalue is of multiplicity 1): i.e., for W = 
V{. 

A variable W = V_*h will be uncorrelated with {V{, . . . , V^) if and only if 6j = 
for i = 1, . . . , /c. Its variance will then be J2i=k+i the same argument as 

above shows that this will be maximised, for W normalised, when W = □ 

The first principal component Vi is often interpreted as the most important con- 
structed variable, since (subject to normalisation) it varies the most. For certain 
purposes {e.g. predicting some other variable) we might consider discarding all 
other variables and only using V{. Similarly for k < p the first fc-principal com- 
ponents might be regarded as the most important set of k variables, and only 
these retained — thus achieving dimension reduction. However whether these 
aspirations are achieved is quite another matter. There is also the question of 
the choice of k: a common criterion is to fix vr G (0, 1) close to 1 and choose 
the minimum k for which Yli=i ■^i / Yl^=i — ^- (This quantity is sometimes, 
though not very appropriately, called the proportion of total variance explained 
by the first k principal components.) 

We have noted that the above construction is relative to a specific choice of basic 
variables V_. Starting with another set (even if they are scalar multiples of the 
originals) will change S, and typically produce different principal components (in 
the context of Theorem El it will involve a different understanding of what it 
means for a variable to be normalised). It is common first to scale each of the 
initial basic variables by its standard deviation, so that S becomes the correlation 
matrix, and to construct principal components from that. 



7.1 Data 

When we do not know S, we might replace it by a sample-based estimate such 
as S'^. With ~ Wp{u; E) the exact joint sampling distribution of the sample 
eigenvalues (Li . . . , Lp) and eigenvectors of S'^ can be found but is (unsurprisingly) 
complex. Asymptotically, log(Li/Aj) ~ A/'(0, 2/i^), independently. 

To investigate the possibility of reducing the dimension to k < p one might think 
of testing the hypothesis A^+i = . . . Ap = 0. However, if this is true then S is 
singular, with probability 1 X is confined to a A;- dimensional subspace, and the 
sample eigenvalues L^+i, . . . ,Lp will all be zero. Hence if we observe non-singular 
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S we know the hypothesis is false, and no test is called for. 

In summary, principal component analysis is best viewed as a descriptive, rather 
than an analytic, technique. 



